VOLVER

Curso 4 Exploratory Data Analysis

Análisis Exploratorio

EL análisis exploratorio es una fase dentro del data science que sirve para comprender la naturaleza de los datos y las relacciones que hay entre los mismos. Se trata por tanto de ir recorriendo los distintos datasets y aplicar métodos gráficos para comprender estas questiones. Para ellos haremos uso de tres herramientas, las librerías base de R, lattice system y ggplot2.

Los principios de los gráficos analíticos son los siguientes:

  1. Mostrar comparaciones. Siempre preguntar ¿comparado con qué?
  2. Mostrar causalidad,mecanismo, explicación y estructura sistemática
  3. Enseñar datos multivariables (más de 2 variables, necesario escape “flatland”)
  4. Integración de la evidencia (No dejar a la herramienta conducir el análisis, y mostrar palabras, números, imágenes, etc.)
  5. Describir y documentar la evidencia con etiquetas apropiadas, escalas, fuentes, etc. 6 Content is king. Las presentaciones analíticas son un éxito o no dependiendo de la calidad, relevancia e integridad de su contenido.

Un análisis exploratorio nos va a ayudar de la siguiente manera:

  • Comprender las propiedades de los datos
  • Encontrar patrones en los datos
  • Sugerir estrategias de modelado
  • Hacer “debug” del análisis
  • Comunicar resultados

Las características de los gráficos exploratorios son las siguientes:

  • Se hacen rápido
  • Se hacen muchos
  • El objetivo es la compresión personal
  • Ejes/leyendas son generalmente limpiadas al final
  • Color/Tamaño son primeramente usadas en la información

A continuación se muestra cómo sumarizar los datos a través de una dimensión, dos dimensiones y X dimensiones:

qownnotes-media-a10840

qownnotes-media-a10840

qownnotes-media-l10840

qownnotes-media-l10840

qownnotes-media-B10840

qownnotes-media-B10840

qownnotes-media-c10840

qownnotes-media-c10840

qownnotes-media-T10840

qownnotes-media-T10840

qownnotes-media-l10840

qownnotes-media-l10840

qownnotes-media-y10840

qownnotes-media-y10840

qownnotes-media-i10840

qownnotes-media-i10840

qownnotes-media-L10840

qownnotes-media-L10840

qownnotes-media-k10840

qownnotes-media-k10840

qownnotes-media-b10840

qownnotes-media-b10840

qownnotes-media-k10840

qownnotes-media-k10840

qownnotes-media-Q10840

qownnotes-media-Q10840

qownnotes-media-c10840

qownnotes-media-c10840

Base plotting

qownnotes-media-p10840

qownnotes-media-p10840

qownnotes-media-n10840

qownnotes-media-n10840

Lattice system

Para cuando se quiere mirar gráficamente muchos datos a la vez, o mirar la información por grupos y subgrupos.

qownnotes-media-m11636

qownnotes-media-m11636

qownnotes-media-M11636

qownnotes-media-M11636

qownnotes-media-j11636

qownnotes-media-j11636

qownnotes-media-J11636

qownnotes-media-J11636

qownnotes-media-O10840

qownnotes-media-O10840

qownnotes-media-p10840

qownnotes-media-p10840

qownnotes-media-d10840

qownnotes-media-d10840

qownnotes-media-u10840

qownnotes-media-u10840

qownnotes-media-e10840

qownnotes-media-e10840

qownnotes-media-y10840

qownnotes-media-y10840

qownnotes-media-Y10840

qownnotes-media-Y10840

qownnotes-media-R10840

qownnotes-media-R10840

ggplot2

qplot –> siempre se le pasa un dataset

las grafcas se hacen con aestetics (color,tamaño, forma) y con geoms (puntos,lineas,barras)

qownnotes-media-Q11636

qownnotes-media-Q11636

Su función base es ggplot, que es muy flexible y se puede usar cuando qplot se queda corto.

gráfico básico

qownnotes-media-j11636

qownnotes-media-j11636

Aádir la línea suavizada y el intervalo de confianza del 95% de esa línea

qownnotes-media-Y11636

qownnotes-media-Y11636

Añadir colores con una dimension

qownnotes-media-k11636

qownnotes-media-k11636

Hacer dimensiones separadas, no por color

qownnotes-media-k11636

qownnotes-media-k11636

Si se pone un solo un eje conseguimos un histograma

qownnotes-media-L11636

qownnotes-media-L11636

Podemos ver otro tipo de gráfico que muestra una mezcla entre histograma y boxplot, que es gráfico de violin, que nos muestra en un boxplot qué valores se repiten más:

qownnotes-media-buBvrl

qownnotes-media-buBvrl

qownnotes-media-rSNZOT

qownnotes-media-rSNZOT

CASO DE ESTUDIO

qownnotes-media-S11636

qownnotes-media-S11636

eno –> Indica si hay inflamación en los pulmones

qownnotes-media-y11636

qownnotes-media-y11636

qownnotes-media-R11636

qownnotes-media-R11636

qownnotes-media-T11636

qownnotes-media-T11636

qownnotes-media-V11636

qownnotes-media-V11636

Ver relacción entre 2 variables: ENO VS PM25

qownnotes-media-u11636

qownnotes-media-u11636

Podemos sacar también sus curvas suavizadas

qownnotes-media-T11636

qownnotes-media-T11636

Igual pero con facets

qownnotes-media-e11636

qownnotes-media-e11636

Añadimos al estudio el BMI, para ver si su valor tiene alguna relacción con la relacción entre ser asmático y el pm25

Aquí vemos en un primer gráfico que algo pasa en los que tienen sobrepeso, la curva de regresión es ascendente:

qownnotes-media-b11636

qownnotes-media-b11636

Al estar creado en capas, podemos probar por ejemplo los distintos tipos de suavizados que hay. En este caso el que viene por defecto que es LOWES tiene mucho ruido en el intervalos de confianza debido a que hay no hay muchos datos en los extremos, y LOWES se flexibiliza mucho con los datos, por lo que usaremos un modelo lineal:

qownnotes-media-x11636

qownnotes-media-x11636

qownnotes-media-k11636

qownnotes-media-k11636

Para dar colorr a los puntos, si son muchos está bien darles transparencia a los puntos

qownnotes-media-b11636

qownnotes-media-b11636

qownnotes-media-H11636

qownnotes-media-H11636

Con SE quitamos el itervalo de confianza

qownnotes-media-B11636

qownnotes-media-B11636

qownnotes-media-p11636

qownnotes-media-p11636

Cambiar los límites de los ejes sí es algo más distinto entre ggplot y el base. Se ha introducido intencionadamente un outlier que produce mucho ruido a la hora de visualizar los datos, con la función base es sencillo:

qownnotes-media-w11636

qownnotes-media-w11636

Hay que tener cuidado porque si hacemos esto en ggplot, estaremos quitando del dataset los valores que estén fuera de esos límites, hay que quitárselo sin embargo a los ejes de coordenadas:

qownnotes-media-v11636

qownnotes-media-v11636

HACEMOS MÁS COMPLICADO EL CASO, INTRODUCIENDO LOS SINTOMAS NOCTURNOS Y NO2 QUE ES UNA VARIABLE CONTINUA

El problema de las variables continuas es que son dificiles de categorizar para mostrarla en gráficos, para ello se puede hacer uso de los terciles (bajo,medio,alto)

qownnotes-media-i11636

qownnotes-media-i11636

EJERCICIOS

LATTICE

xyplot(Ozone ~ Wind | as.factor(Month), data = airquality, layout=c(5,1))

That’s correct!
qownnotes-media-c10584

qownnotes-media-c10584

Since Month is a named column of the airquality dataframe we had to tell R to treat it as a factor. To see how this affects the plot, rerun the xyplot command you just ran, but use Ozone ~ Wind | Month instead of Ozone ~ Wind | as.factor(Month) as the first argument.

xyplot(Ozone ~ Wind | Month, data = airquality, layout=c(5,1))

Your dedication is inspiring!

|=========================== | 30%

Not as informative, right? The word Month in each panel really doesn’t tell you much if it doesn’t identify which month it’s plotting. Notice that the actual data is the same between the two plots, though.

|============================ | 31%

Lattice functions behave differently from base graphics functions in one critical way. Recall that base graphics functions plot data directly to the graphics device (e.g., screen, or file such as a PDF file). In contrast, lattice graphics functions return an object of class trellis.

|============================= | 33%

The print methods for lattice functions actually do the work of plotting the data on the graphics device. They return “plot objects” that can be stored (but it’s usually better to just save the code and data). On the command line, trellis objects are auto-printed so that it appears the function is plotting the data.
To see this, create a variable p which is assigned the output of this simple call to xyplot, xyplot(Ozone~Wind,data=airquality).

p <- xyplot(Ozone~Wind,data=airquality)

Excellent work!

|================================ | 36%

Nothing plotted, right? But the object p is around.

|================================= | 37%

Type p or print(p) now to see it.

print(p)

Excellent work!
Like magic, it appears. Now run the R command names with p as its argument.

names(p) [1] “formula” “as.table” “aspect.fill” “legend”
[5] “panel” “page” “layout” “skip”
[9] “strip” “strip.left” “xscale.components” “yscale.components” [13] “axis” “xlab” “ylab” “xlab.default”
[17] “ylab.default” “xlab.top” “ylab.right” “main”
[21] “sub” “x.between” “y.between” “par.settings”
[25] “plot.args” “lattice.options” “par.strip.text” “index.cond”
[29] “perm.cond” “condlevels” “call” “x.scales”
[33] “y.scales” “panel.args.common” “panel.args” “packet.sizes”
[37] “x.limits” “y.limits” “x.used.at” “y.used.at”
[41] “x.num.limit” “y.num.limit” “aspect.ratio” “prepanel.default” [45] “prepanel”

You’re the best!

|==================================== | 40%

We see that the trellis object p has 45 named properties, the first of which is “formula” which isn’t too surprising. A lot of these properties are probably NULL in value. We’ve done some behind-the-scenes work for you and created two vectors. The first, mynames, is a character vector of the names in p. The second is a boolean vector, myfull, which has TRUE values for nonnull entries of p. Run mynames[myfull] to see which entries of p are not NULL.
We see that the trellis object p has 45 named properties, the first of which is “formula” which isn’t too surprising. A lot of these properties are probably NULL in value. We’ve done some behind-the-scenes work for you and created two vectors. The first, mynames, is a character vector of the names in p. The second is a boolean vector, myfull, which has TRUE values for nonnull entries of p. Run mynames[myfull] to see which entries of p are not NULL.

mynames[myfull][1] “formula” “as.table” “aspect.fill” “panel”
[5] “skip” “strip” “strip.left” “xscale.components” [9] “yscale.components” “axis” “xlab” “ylab”
[13] “xlab.default” “ylab.default” “x.between” “y.between”
[17] “index.cond” “perm.cond” “condlevels” “call”
[21] “x.scales” “y.scales” “panel.args.common” “panel.args”
[25] “packet.sizes” “x.limits” “y.limits” “aspect.ratio”
[29] “prepanel.default”

That’s the answer I was looking for.

|===================================== | 42%

Wow! 29 nonNull values for one little plot. Note that a lot of them are like the ones we saw in the base plotting system. Let’s look at the values of some of them. Type p[[“formula”]] now.

p[[“formula”]] Ozone ~ Wind

You got it!

|======================================= | 43%

Not surprising, is it? It’s a familiar formula. Now look at p’s x.limits. Remember the double square brackets and quotes.

p[[“x.limits”]][1] 0.37 22.03

That’s the answer I was looking for.

|======================================== | 45%

They match the plot, right? The x values are indeed between .37 and 22.03.

|========================================= | 46%

Again, not surprising. Before we wrap up, let’s talk about lattice’s panel functions which control what happens inside each panel of the plot. The ease of making multi-panel plots makes lattice very appealing. The lattice package comes with default panel functions, but you can customize what happens in each panel.

|=========================================== | 48%

Panel functions receive the x and y coordinates of the data points in their panel (along with any optional arguments). To see this, we’ve created some data for you - two 100-long vectors, x and y. For its first 50 values y is a function of x, for the last 50 values, y is random. We’ve also defined a 100-long factor vector f which distinguishes between the first and last 50 elements of the two vectors. Run the R command table with f as it argument.

COLORS

So we’ll first discuss some functions that the grDevices package offers. The function colors() lists the names of 657 predefined colors you can use in any plotting function. These names are returned as strings. Run the R command sample with colors() as its first argument and 10 as its second to give you an idea of the choices you have.

sample(colors(), 10) [1] “lightyellow2” “ivory2” “palevioletred1” “grey100” “gray76”
[6] “coral” “mediumpurple2” “rosybrown4” “grey94” “chartreuse”

That’s the answer I was looking for.

|============ | 13%

We see a lot of variety in the colors, some of which are names followed by numbers indicating that there are multiple forms of that particular color.

|============= | 14%

So you’re free to use any of these 600+ colors listed by the colors function. However, two additional functions from grDevices, colorRamp and colorRampPalette, give you more options. Both of these take color names as arguments and use them as “palettes”, that is, these argument colors are blended in different proportions to form new colors.

|============== | 16%

The first, colorRamp, takes a palette of colors (the arguments) and returns a function that takes values between 0 and 1 as arguments. The 0 and 1 correspond to the extremes of the color palette. Arguments between 0 and 1 return blends of these extremes.

|=============== | 17%

Let’s see what this means. Assign to the variable pal the output of a call to colorRamp with the single argument, c(“red”,“blue”).

pal <- colorRamp(c(“red”,“blue”))

Excellent job!

|================= | 19%

We don’t see any output, but R has created the function pal which we can call with a single argument between 0 and 1. Call pal now with the argument 0.

pal(0) [,1] [,2] [,3][1,] 255 0 0

Nice work!

|================== | 20%

We see a 1 by 3 array with 255 as the first entry and 0 in the other 2. This 3 long vector corresponds to red, green, blue (RGB) color encoding commonly used in televisions and monitors. In R, 24 bits are used to represent colors. Think of these 24 bits as 3 sets of 8 bits, each of which represents an intensity for one of the colors red, green, and blue.

|=================== | 22%

The 255 returned from the pal(0) call corresponds to the largest possible number represented with 8 bits, so the vector (255,0,0) contains only red (no green or blue), and moreover, it’s the highest possible value of red.

|===================== | 23%

Given that you created pal with the palette containing “red” and “blue”, what color do you think will be represented by the vector that pal(1) returns? Recall that pal will only take arguments between 0 and 1, so 1 is the largest argument you can pass it.

1: yellow 2: green 3: blue 4: red

Selection: 3

Keep up the great work!

|====================== | 25%

Check your answer now by calling pal with the argument 1.

pal(1) [,1] [,2] [,3][1,] 0 0 255

All that practice is paying off!

|======================= | 26%

You see the vector (0,0,255) which represents the highest intensity of blue. What vector do you think the call pal(.5) will return?

1: (255,255,255) 2: (255,0,255) 3: (127.5,0,127.5) 4: (0,255,0)

Selection: 2

That’s not exactly what I’m looking for. Try again.
The correct answer should be halfway between (255,0,0) and (0,0,255). Which is the only choice that is the average (mean) of these two?

1: (127.5,0,127.5) 2: (255,0,255) 3: (255,255,255) 4: (0,255,0)

Selection: 4

Nice try, but that’s not exactly what I was hoping for. Try again.
The correct answer should be halfway between (255,0,0) and (0,0,255). Which is the only choice that is the average (mean) of these two?

1: (127.5,0,127.5) 2: (0,255,0) 3: (255,255,255) 4: (255,0,255)

Selection: 1

All that hard work is paying off!

|========================= | 28%

The function pal can take more than one argument. It returns one 3-long (or 4-long, but more about this later) vector for each argument. To see this in action, call pal with the argument seq(0,1,len=6).

pal(seq(0,1,len=6)) [,1] [,2] [,3][1,] 255 0 0 [2,] 204 0 51 [3,] 153 0 102 [4,] 102 0 153 [5,] 51 0 204 [6,] 0 0 255

Your dedication is inspiring!

|========================== | 29%

Six vectors (each of length 3) are returned. The i-th vector is identical to output that would be returned by the call pal(i/5) for i=0,…5. We see that the i-th row (for i=1,…6) differs from the (i-1)-st row in the following way. Its red entry is 51 = 255/5 points lower and its blue entry is 51 points higher.

|=========================== | 30%

So pal creates colors using the palette we specified when we called colorRamp. In this example none of pal’s outputs will ever contain green since it wasn’t in our initial palette.

|============================ | 32%

We’ll turn now to colorRampPalette, a function similar to colorRamp. It also takes a palette of colors and returns a function. This function, however, takes integer arguments (instead of numbers between 0 and 1) and returns a vector of colors each of which is a blend of colors of the original palette.

|============================== | 33%

The argument you pass to the returned function specifies the number of colors you want returned. Each element of the returned vector is a 24 bit number, represented as 6 hexadecimal characters, which range from 0 to F. This set of 6 hex characters represents the intensities of red, green, and blue, 2 characters for each color.

|=============================== | 35%

To see this better, assign to the variable p1 the output of a call to colorRampPalette with the single argument, c(“red”,“blue”). We’ll compare it to our experiments using colorRamp.

p1 <- colorRampPalette(c(“red”,“blue”))

That’s a job well done!

|================================ | 36%

Now call p1 with the argument 2.

p1(2) [1] “#FF0000” “#0000FF”

Keep up the great work!

|================================== | 38%

We see a 2-long vector is returned. The first entry FF0000 represents red. The FF is hexadecimal for 255, the same value returned by our call pal(0). The second entry 0000FF represents blue, also with intensity 255.

|=================================== | 39%

Now call p1 with the argument 6. Let’s see if we get the same result as we did when we called pal with the argument seq(0,1,len=6).

p1(6) [1] “#FF0000” “#CC0033” “#990066” “#650099” “#3200CC” “#0000FF”

Nice work!

|==================================== | 41%

Now we get the 6-long vector (FF0000, CC0033, 990066, 650099, 3200CC, 0000FF). We see the two ends (FF0000 and 0000FF) are consistent with the colors red and blue. How about CC0033? Type 0xcc or 0xCC at the command line to see the decimal equivalent of this hex number. You must include the 0 before the x to specify that you’re entering a hexadecimal number.

0xcc [1] 204

You nailed it! Good job!

|===================================== | 42%

So 0xCC equals 204 and we can easily convert hex 33 to decimal, as in 0x33=3*16+3=51. These were exactly the numbers we got in the second row returned from our call to pal(seq(0,1,len=6)). We see that 4 of the 6 numbers agree with our earlier call to pal. Two of the 6 differ slightly.

|======================================= | 43%

We can also form palettes using colors other than red, green and blue. Form a palette, p2, by calling colorRampPalette with the colors “red” and “yellow”. Remember to concatenate them into a single argument.

p1 <- colorRampPalette(c(“red”,“yellow”))

Not exactly. Give it another go. Or, type info() for more options.
Type p2 <- colorRampPalette(c(“red”,“yellow”)) at the command prompt.

p2 <- colorRampPalette(c(“red”,“yellow”))

You got it right!

|======================================== | 45%

Now call p2 with the argument 2. This will show us the two extremes of the blends of colors we’ll get.

p2(2) [1] “#FF0000” “#FFFF00”

All that hard work is paying off!

|========================================= | 46%

Not surprisingly the first color we see is FF0000, which we know represents red. The second color returned, FFFF00, must represent yellow, a combination of full intensity red and full intensity green. This makes sense, since yellow falls between red and green on the color wheel as we see here. (We borrowed this image from lucaskrech.com.)

|=========================================== | 48%

Let’s now call p2 with the argument 10. This will show us how the two extremes, red and yellow, are blended together.

p2(10) [1] “#FF0000” “#FF1C00” “#FF3800” “#FF5500” “#FF7100” “#FF8D00” “#FFAA00” “#FFC600” “#FFE200” [10] “#FFFF00”

Excellent job!

|============================================ | 49%

So we see the 10-long vector. For each element, the red component is fixed at FF, and the green component grows from 00 (at the first element) to FF (at the last).

|============================================= | 51%

This is all fine and dandy but you’re probably wondering when you can see how all these colors show up in a display. We copied some code from the R documentation pages (color.scale if you’re interested) and created a function for you, showMe. This takes as an argument, a color vector, which as you know, is precisely what calls to p1 and p2 return to you. Call showMe now with p1(20).

showMe(p1(20))

qownnotes-media-J10584

qownnotes-media-J10584

Great job!

|============================================== | 52%

We see the interpolated palette here. Low values in the lower left corner are red and as you move to the upper right, the colors move toward blue. Now call showMe with p2(20) as its argument.

showMe(p2(20))

qownnotes-media-B10584

qownnotes-media-B10584

All that hard work is paying off!

|================================================ | 54%

Here we see a similar display, the colors moving from red to yellow, the base colors of our p2 palette. For fun, see what p2(2) looks like using showMe.

showMe(p2(2))

qownnotes-media-D10584

qownnotes-media-D10584

That’s a job well done!

|================================================= | 55%

A much more basic pattern, simple but elegant.

|================================================== | 57%

We mentioned before that colorRamp (and colorRampPalette) could return a 3 or 4 long vector of colors. We saw 3-long vectors returned indicating red, green, and blue intensities. What would the 4th entry be?
Type p1 at the command prompt.

p1 function (n) { x <- ramp(seq.int(0, 1, length.out = n)) if (ncol(x) == 4L) rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255) else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255) }

So the fourth argument is alpha which can be a logical, i.e., either TRUE or FALSE, or a numerical value. Create the function p3 now by calling colorRampPalette with the colors blue and green (remember to concatenate them into a single argument) and setting the alpha argument to .5.

p3 <- colorRampPalette(c(“blue”,“green”),.5)

That’s not exactly what I’m looking for. Try again. Or, type info() for more options.
Type p3 <- colorRampPalette(c(“blue”,“green”),alpha=.5) at the command prompt.

p3 <- colorRampPalette(c(“blue”,“green”),alpha=.5)

All that hard work is paying off!

|========================================================== | 65%

Now call p3 with the argument 5.

p3(5) [1] “#0000FFFF” “#003FBFFF” “#007F7FFF” “#00BF3FFF” “#00FF00FF”

You are doing so well!

|=========================================================== | 67%

We see that in the 5-long vector that the call returned, each element has 32 bits, 4 groups of 8 bits each. The last 8 bits represent the value of alpha. Since it was NOT ZERO in the call to colorRampPalette, it gets the maximum FF value. (The same result would happen if alpha had been set to TRUE.) When it was 0 or FALSE (as in previous calls to colorRampPalette) it was given the value 00 and wasn’t shown. The leftmost 24 bits of each element are the same RGB encoding we previously saw.

|============================================================= | 68%

So what is alpha? Alpha represents an opacity level, that is, how transparent should the colors be. We can add color transparency with the alpha parameter to calls to rgb. We haven’t seen any examples of this yet, but we will now.

|============================================================== | 70%

We generated 1000 random normal pairs for you in the variables x and y. We’ll plot them in a scatterplot by calling plot with 4 arguments. The variables x and y are the first 2. The third is the print character argument pch. Set this equal to 19 (filled circles). The final argument is col which should be set equal to a call to rgb. Give rgb 3 arguments, 0, .5, and .5.

GGPLOT

qplot(displ, hwy, data = mpg)

Great job!

|============================== | 34%

A nice scatterplot done simply, right? All the labels are provided. The first argument is shown along the x-axis and the second along the y-axis. The negative trend (increasing displacement and lower gas mileage) is pretty clear. Now suppose we want to do the same plot but this time use different colors to distinguish between the 3 factors (subsets) of different types of drive (drv) in the data (front-wheel, rear-wheel, and 4-wheel). Again, qplot makes this very easy. We’ll just add what ggplot2 calls an aesthetic, a fourth argument, color, and set it equal to drv. Try this now. (Use the up arrow key to save some typing.)

qplot(displ, hwy, data = mpg, color = drv)

qownnotes-media-o12260

qownnotes-media-o12260

Pretty cool, right? See the legend to the right which qplot helpfully supplied? The colors were automatically assigned by qplot so the legend decodes the colors for you. Notice that qplot automatically used dots or points to indicate the data. These points are geoms (geometric objects). We could have used a different aesthetic, for instance shape instead of color, to distinguish between the drive types.
Now let’s add a second geom to the default points. How about some smoothing function to produce trend lines, one for each color? Just add a fifth argument, geom, and using the R function c(), set it equal to the concatenation of the two strings “point” and “smooth”. The first refers to the data points and second to the trend lines we want plotted. Try this now.

qplot(displ, hwy, data = mpg, color = drv, geom = c(“point”,“smooth”)) geom_smooth() using method = ‘loess’

Your dedication is inspiring!

|===================================== | 41%

Notice the gray areas surrounding each trend lines. These indicate the 95% confidence intervals for the lines.
qownnotes-media-i12260

qownnotes-media-i12260

qplot(y=hwy, data = mpg, color = drv)

You nailed it! Good job!

|========================================= | 46%

What’s this plot showing? We see the x-axis ranges from 0 to 250 and we remember that we had 234 data points in our set, so we can infer that each point in the plot represents one of the hwy values (indicated by the y-axis). We’ve created the vector myhigh for you which contains the hwy data from the mpg dataset. Look at myhigh now.
qownnotes-media-Z12260

qownnotes-media-Z12260

The all-purpose qplot can also create box and whisker plots. Call qplot now with 4 arguments. First specify the variable by which you’ll split the data, in this case drv, then specify the variable which you want to examine, in this case hwy. The third argument is data (set equal to mpg), and the fourth, the geom, set equal to the string “boxplot”

qplot(drv, hwy, data = mpg, geom = “boxplot”)

qownnotes-media-P12260

qownnotes-media-P12260

We see 3 boxes, one for each drive. Now to impress you, call qplot with 5 arguments. The first 4 are just as you used previously, (drv, hwy, data set equal to mpg, and geom set equal to the string “boxplot”). Now add a fifth argument, color, equal to manufacturer.

qplot(drv, hwy, data = mpg, geom = “boxplot”, color = manufacturer)

qownnotes-media-O12260

qownnotes-media-O12260

It’s a little squished but we just wanted to illustrate qplot’s capabilities. Notice that there are still 3 regions of the plot (determined by the factor drv). Each is subdivided into several boxes depicting different manufacturers.
Now, on to histograms. These display frequency counts for a single variable. Let’s start with an easy one. Call qplot with 3 arguments. First specify the variable for which you want the frequency count, in this case hwy, then specify the data (set equal to mpg), and finally, the aesthetic, fill, set equal to drv. Instead of a plain old histogram, this will again use colors to distinguish the 3 different drive factors.

qplot(hwy, data = mpg, fill = drv) stat_bin() using bins = 30. Pick better value with binwidth.

Excellent work!

|====================================================== | 61%

See how qplot consistently uses the colors. Red (if 4-wheel drv is in the bin) is at the bottom of the bin, then green on top of it (if present), followed by blue (rear wheel drv). The color lets us see right away that 4-wheel drive vehicles in this dataset don’t have gas mileages exceeding 30 miles per gallon.
qownnotes-media-A12260

qownnotes-media-A12260

It’s cool that qplot can do this so easily, but some people may find this multi-color histogram hard to interpret. Instead of using colors to distinguish between the drive factors let’s use facets or panels. (That’s what lattice called them.) This just means we’ll split the data into 3 subsets (according to drive) and make 3 smaller individual plots of each subset in one plot (and with one call to qplot).

|=========================================================== | 66%

Remember that with base plot we had to do each subplot individually. The lattice system made plotting conditioning plots easier. Let’s see how easy it is with qplot.

|============================================================= | 68%

We’ll do two plots, a scatterplot and then a histogram, each with 3 facets. For the scatterplot, call qplot with 4 arguments. The first two are displ and hwy and the third is the argument data set equal to mpg. The fourth is the argument facets which will be set equal to the expression . ~ drv which is ggplot2’s shorthand for number of rows (to the left of the ~) and number of columns (to the right of the ~). Here the . indicates a single row and drv implies 3, since there are 3 distinct drive factors. Try this now.

qplot(displ,hwy, data = mpg, facets = .~ drv)

qownnotes-media-v12260

qownnotes-media-v12260

The result is a 1 by 3 array of plots. Note how each is labeled at the top with the factor label (4,f, or r). This shows us more detailed information than the histogram. We see the relationship between displacement and highway mileage for each of the 3 drive factors.
Now we’ll do a histogram, again calling qplot with 4 arguments. This time, since we need only one variable for a histogram, the first is hwy and the second is the argument data set equal to mpg. The third is the argument facets which we’ll set equal to the expression drv ~ . . This will give us a different arrangement of the facets. The fourth argument is binwidth. Set this equal to 2. Try this now.

qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)

qownnotes-media-P12260

qownnotes-media-P12260

A “grammar” of graphics means that ggplot2 contains building blocks with which you can create your own graphical objects. What are these basic components of ggplot2 plots? There are 7 of them.

|======= | 8%

Obviously, there’s a DATA FRAME which contains the data you’re trying to plot. Then the AESTHETIC MAPPINGS determine how data are mapped to color, size, etc. The GEOMS (geometric objects) are what you see in the plot (points, lines, shapes) and FACETS are the panels used in conditional plots. You’ve used these or seen them used in the first ggplot2 (qplot) lesson.

|========= | 10%

There are 3 more. STATS are statistical transformations such as binning, quantiles, and smoothing which ggplot2 applies to the data. SCALES show what coding an aesthetic map uses (for example, male = red, female = blue). Finally, the plots are depicted on a COORDINATE SYSTEM. When you use qplot these were taken care of for you.
We’ll keep using the mpg data that comes with the ggplot2 package. Recall the versatility of qplot. Just as a refresher, call qplot now with 5 arguments. The first 3 deal with data - displ, hwy, and data=mpg. The fourth is geom set equal to the concatenation of the two strings, “point” and “smooth”. The fifth is facets set equal to the formula .~drv. Try this now.

qplot(displ,hwy, data = mpg, geom = c(“point”,“smooth”),facets = .~ drv)

qownnotes-media-m12260

qownnotes-media-m12260

We see a 3 facet plot, one for each drive type (4, f, and r). Now we’ll see how ggplot works. We’ll build up a similar plot using the basic components of the package. We’ll do this in a series of steps.
First we’ll create a variable g by assigning to it the output of a call to ggplot with 2 arguments. The first is mpg (our dataset) and the second will tell ggplot what we want to plot, in this case, displ and hwy. These are what we want our aesthetics to represent so we enclose these as two arguments to the function aes. Try this now.

g <- ggplot(mpg, aes(displ,hwy) )

Nice work!

|==================== | 23%

Notice that nothing happened? As in the lattice system, ggplot created a graphical object which we assigned to the variable g.

|====================== | 25%

Run the R command summary with g as its argument to see what g contains.
Notice that nothing happened? As in the lattice system, ggplot created a graphical object which we assigned to the variable g.

|====================== | 25%

Run the R command summary with g as its argument to see what g contains.

summary(g) data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11] mapping: x = displ, y = hwy faceting: compute_layout: function draw_back: function draw_front: function draw_labels: function draw_panels: function finish_data: function init_scales: function map: function map_data: function params: list render_back: function render_front: function render_panels: function setup_data: function setup_params: function shrink: TRUE train: function train_positions: function train_scales: function vars: function super:

You nailed it! Good job!

|======================== | 27%

So g contains the mpg data with all its named components in a 234 by 11 matrix. It also contains a mapping, x (displ) and y (hwy) which you specified, and no faceting.

|========================== | 29%

Note that if you tried to print g with the expressions g or print(g) you’d get an error! Even though it’s a great package, ggplot doesn’t know how to display the data yet since you didn’t specify how you wanted to see it. Now type g+geom_point() and see what happens.
Note that if you tried to print g with the expressions g or print(g) you’d get an error! Even though it’s a great package, ggplot doesn’t know how to display the data yet since you didn’t specify how you wanted to see it. Now type g+geom_point() and see what happens.

g+geom_point()

Nice work!

|============================ | 31%

By calling the function geom_point you added a layer. By not assigning the expression to a variable you displayed a plot. Notice that you didn’t have to pass any arguments to the function geom_point. That’s because the object g has all the data stored in it. (Remember you saw that when you ran summary on g before.) Now use the expression you just typed (g + geom_point()) and add to it another layer, a call to geom_smooth(). Notice the red message R gives you.

g+geom_point()+geom_smooth() geom_smooth() using method = ‘loess’

All that hard work is paying off!
qownnotes-media-q12260

qownnotes-media-q12260

The gray shadow around the blue line is the confidence band. See how wide it is at the right? Let’s try a different smoothing function. Use the up arrow to recover the expression you just typed, and instead of calling geom_smooth with no arguments, call it with the argument method set equal to the string “lm”.

g+geom_point()+geom_smooth(method = “lm”)

qownnotes-media-x12260

qownnotes-media-x12260

By changing the smoothing function to “lm” (linear model) ggplot2 generated a regression line through the data. Now recall the expression you just used and add to it another call, this time to the function facet_grid. Use the formula . ~ drv as it argument. Note that this is the same type of formula used in the calls to qplot.

g+geom_point()+geom_smooth(method = “lm”)+facet_grid(. ~ drv)

qownnotes-media-Z12260

qownnotes-media-Z12260

Notice how each panel is labeled with the appropriate factor. All the data associated with 4-wheel drive cars is in the leftmost panel, front-wheel drive data is shown in the middle panel, and rear-wheel drive data in the rightmost. Notice that this is similar to the plot you created at the start of the lesson using qplot. (We used a different smoothing function than previously.)

|=================================== | 40%

So far you’ve just used the default labels that ggplot provides. You can add your own annotation using functions such as xlab(), ylab(), and ggtitle(). In addition, the function labs() is more general and can be used to label either or both axes as well as provide a title. Now recall the expression you just typed and add a call to the function ggtitle with the argument “Swirl Rules!“.

g+geom_point()+geom_smooth(method = “lm”)+facet_grid(. ~ drv)+ggtitle(“Swirl Rules!”)

Now that you’ve seen the basics we’ll talk about customizing. Each of the “geom” functions (e.g., _point and _smooth) has options to modify it. Also, the function theme() can be used to modify aspects of the entire plot, e.g. the position of the legend. Two standard appearance themes are included in ggplot. These are theme_gray() which is the default theme (gray background with white grid lines) and theme_bw() which is a plainer (black and white) color scheme.

|======================================= | 44%

Let’s practice modifying aesthetics now. We’ll use the graphic object g that we already filled with mpg data and add a call to the function geom_point, but this time we’ll give geom_point 3 arguments. Set the argument color equal to “pink”, the argument size to 4, and the argument alpha to 1/2. Notice that all the arguments are set equal to constants.
Let’s practice modifying aesthetics now. We’ll use the graphic object g that we already filled with mpg data and add a call to the function geom_point, but this time we’ll give geom_point 3 arguments. Set the argument color equal to “pink”, the argument size to 4, and the argument alpha to 1/2. Notice that all the arguments are set equal to constants.

g+geom_point(color = “pink”, size = 4, alpha = 1/2)

qownnotes-media-P12260

qownnotes-media-P12260

Notice the different shades of pink? That’s the result of the alpha aesthetic which you set to 1/2. This aesthetic tells ggplot how transparent the points should be. Darker circles indicate values hit by multiple data points.

|=========================================== | 48%

Now we’ll modify the aesthetics so that color indicates which drv type each point represents. Again, use g and add to it a call to the function geom_point with 3 arguments. The first is size set equal to 4, the second is alpha equal to 1/2. The third is a call to the function aes with the argument color set equal to drv. Note that you MUST use the function aes since the color of the points is data dependent and not a constant as it was in the previous example.
Now we’ll modify the aesthetics so that color indicates which drv type each point represents. Again, use g and add to it a call to the function geom_point with 3 arguments. The first is size set equal to 4, the second is alpha equal to 1/2. The third is a call to the function aes with the argument color set equal to drv. Note that you MUST use the function aes since the color of the points is data dependent and not a constant as it was in the previous example.

g+geom_point(aes(color = drv), size = 4, alpha = 1/2)

qownnotes-media-H12260

qownnotes-media-H12260

Your dedication is inspiring!

|============================================ | 50%

Notice the helpful legend on the right decoding the relationship between color and drv.

|============================================== | 52%

Now we’ll practice modifying labels. Again, we’ll use g and add to it calls to 3 functions. First, add a call to geom_point with an argument making the color dependent on the drv type (as we did in the previous example). Second, add a call to the function labs with the argument title set equal to “Swirl Rules!”. Finally, add a call to labs with 2 arguments, one setting x equal to “Displacement” and the other setting y equal to “Hwy Mileage”.

g + geom_point(aes(color = drv)) + labs(title=“Swirl Rules!”) + labs(x=“Displacement”,y=“Hwy Mileage”)

qownnotes-media-L12260

qownnotes-media-L12260

Note that you could have combined the two calls to the function labs in the previous example. Now we’ll practice customizing the geom_smooth calls. Use g and add to it a call to geom_point setting the color to drv type (remember to use the call to the aes function), size set to 2 and alpha to 1/2. Then add a call to geom_smooth with 4 arguments. Set size equal to 4, linetype to 3, method to “lm”, and se to FALSE.

g + geom_point(aes(color = drv),size =2, alpha = 1/2) + geom_smooth(size = 4, linetype = 3, method = “lm”, se = FALSE)

qownnotes-media-n12260

qownnotes-media-n12260

What did these arguments do? The method specified a linear regression (note the negative slope indicating that the bigger the displacement the lower the gas mileage), the linetype specified that it should be dashed (not continuous), the size made the dashes big, and the se flag told ggplot to turn off the gray shadows indicating standard errors (confidence intervals).
Finally, let’s do a simple plot using the black and white theme, theme_bw. Specify g and add a call to the function geom_point with the argument setting the color to the drv type. Then add a call to the function theme_bw with the argument base_family set equal to “Times”. See if you notice the difference.

g + geom_point(aes(color = drv)) + theme_bw(base_family=“Times”)

qownnotes-media-P12260

qownnotes-media-P12260

One final note before we go through a more complicated, layered ggplot example, and this concerns the limits of the axes. We’re pointing this out to emphasize a subtle difference between ggplot and the base plotting function plot.
We’ll close with a more complicated example to show you the full power of ggplot and the entire ggplot2 package. We’ll continue to work with the mpg dataset.

|======================================================================== | 81%

Start by creating the graphical object g by assigning to it a call to ggplot with 2 arguments. The first is the dataset and the second is a call to the function aes. This call will have 3 arguments, x set equal to displ, y set equal to hwy, and color set equal to factor(year). This last will allow us to distinguish between the two manufacturing years (1999 and 2008) in our data.

g <- ggplot(mpg,aes(x=displ,y=hwy,color=factor(year)))

We’ll build the plot up step by step. First add to g a call to the function geom_point with 0 arguments.

g+geom_point()

A simple, yet comfortingly familiar scatterplot appears. Let’s make our display a 2 dimensional multi-panel plot. Recall your last command (with the up arrow) and add to it a call the function facet_grid. Give it 2 arguments. The first is the formula drv~cyl, and the second is the argument margins set equal to TRUE. Try this now.

g+geom_point()+facet_grid(drv~cyl,margins=TRUE)

qownnotes-media-P12260

qownnotes-media-P12260

A 4 by 5 plot, huh? The margins argument tells ggplot to display the marginal totals over each row and column, so instead of seeing 3 rows (the number of drv factors) and 4 columns (the number of cyl factors) we see a 4 by 5 display. Note that the panel in position (4,5) is a tiny version of the scatterplot of the entire dataset.

|=================================================================================== | 94%

Now add to your last command (or retype it if you like to type) a call to geom_smooth with 4 arguments. These are method set to “lm”, se set to FALSE, size set to 2, and color set to “black”.
Now add to your last command (or retype it if you like to type) a call to geom_smooth with 4 arguments. These are method set to “lm”, se set to FALSE, size set to 2, and color set to “black”.

g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method = “lm”,se = FALSE,size=2,color=“black”)

qownnotes-media-g12260

qownnotes-media-g12260

Angry Birds? Finally, add to your last command (or retype it if you like to type) a call to the function labs with 3 arguments. These are x set to “Displacement”, y set to “Highway Mileage”, and title set to “Swirl Rules!”.

g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method = “lm”,se = FALSE,size=2,color=“black”)+labs(x=“Displacement”,y=“Highway Mileage”,title=“Swirl Rules!”)

qownnotes-media-w12260

qownnotes-media-w12260

In this lesson we’ll go through a few more qplot examples using diamond data which comes with the ggplot2 package. This data is a little more complicated than the mpg data and it contains information on various characteristics of diamonds.
Now let’s plot a histogram of the price of the 53940 diamonds in this dataset. Recall that a histogram requires only one variable of the data, so run the R command qplot with the first argument price and the argument data set equal to diamonds. This will show the frequency of different diamond prices.

qplot(price,data=diamonds)

qownnotes-media-t12260

qownnotes-media-t12260

Not only do you get a histogram, but you also get a message about the binwidth defaulting to range/30. Recall that range refers to the spread or dispersion of the data, in this case price of diamonds. Run the R command range now with diamonds$price as its argument.

stat_bin() using bins = 30. Pick better value with binwidth.

You are amazing!

|========== | 11%

Not only do you get a histogram, but you also get a message about the binwidth defaulting to range/30. Recall that range refers to the spread or dispersion of the data, in this case price of diamonds. Run the R command range now with diamonds$price as its argument.

range(diamonds$price) [1] 326 18823

Keep up the great work!

|============ | 13%

We see that range returned the minimum and maximum prices, so the diamonds vary in price from $326 to $18823. We’ve done the arithmetic for you, the range (difference between these two numbers) is $18497.

|============= | 15%

Rerun qplot now with 3 arguments. The first is price, the second is data set equal to diamonds, and the third is binwidth set equal to 18497/30). (Use the up arrow to save yourself some typing.) See if the plot looks familiar.

qplot(price,data=diamonds,binwidth=18497/30)

That’s the answer I was looking for.

|=============== | 17%

No more messages in red, but a histogram almost identical to the previous one! If you typed 18497/30 at the command line you would get the result 616.5667. This means that the height of each bin tells you how many diamonds have a price between x and x+617 where x is the left edge of the bin.
We’ve created a vector containing integers that are multiples of 617 for you. It’s called brk. Look at it now.

brk [1] 0 617 1234 1851 2468 3085 3702 4319 4936 5553 6170 6787 7404 8021 8638 [16] 9255 9872 10489 11106 11723 12340 12957 13574 14191 14808 15425 16042 16659 17276 17893 [31] 18510 19127

Perseverance, that’s the answer.

|================== | 20%

We’ve also created a vector containing the number of diamonds with prices between each pair of adjacent entries of brk. For instance, the first count is the number of diamonds with prices between 0 and $617, and the second is the number of diamonds with prices between $617 and $1234. Look at the vector named counts now.

counts [1] 4611 13255 5230 4262 3362 2567 2831 2841 2203 1666 1445 1112 987 766 796 [16] 655 606 553 540 427 429 376 348 338 298 305 269 287 227 251 [31] 97

You are really on a roll!

|==================== | 22%

See how it matches the histogram you just plotted? So, qplot really works!

|===================== | 24%

You’re probably sick of it but rerun qplot again, this time with 4 arguments. The first 3 are the same as the last qplot command you just ran (price, data set equal to diamonds, and binwidth set equal to 18497/30). (Use the up arrow to save yourself some typing.) The fourth argument is fill set equal to cut. The shape of the histogram will be familiar, but it will be more colorful.
qownnotes-media-p12260

qownnotes-media-p12260

This shows how the counts within each price grouping (bin) are distributed among the different cuts of diamonds. Notice how qplot displays these distributions relative to the cut legend on the right. The fair cut diamonds are at the bottom of each bin, the good cuts are above them, then the very good above them, until the ideal cuts are at the top of each bin. You can quickly see from this display that there are very few fair cut diamonds priced above $5000.

Now we’ll replot the histogram as a density function which will show the proportion of diamonds | in each bin. This means that the shape will be similar but the scale on the y-axis will be | different since, by definition, the density function is nonnegative everywhere, and the area | under the curve is one. To do this, simply call qplot with 3 arguments. The first 2 are price | and data (set equal to diamonds). The third is geom which should be set equal to the string | “density”. Try this now.

Now we’ll replot the histogram as a density function which will show the proportion of diamonds in each bin. This means that the shape will be similar but the scale on the y-axis will be different since, by definition, the density function is nonnegative everywhere, and the area under the curve is one. To do this, simply call qplot with 3 arguments. The first 2 are price and data (set equal to diamonds). The third is geom which should be set equal to the string “density”. Try this now.

qplot(price,data=diamonds,geom=“density”)

qownnotes-media-D12260

qownnotes-media-D12260

Notice that the shape is similar to that of the histogram we saw previously. The highest peak is close to 0 on the x-axis meaning that most of the diamonds in the dataset were inexpensive. In general, as prices increase (move right along the x-axis) the number of diamonds (at those prices) decrease. The exception to this is when the price is around $4000; there’s a slight increase in frequency. Let’s see if cut is responsible for this increase.

|============================ | 31%

Rerun qplot, this time with 4 arguments. The first 2 are the usual, and the third is geom set equal to “density”. The fourth is color set equal to cut. Try this now.

qplot(price,data=diamonds,geom=“density”,color=cut)

qownnotes-media-Q12260

qownnotes-media-Q12260

See how easily qplot did this? Four of the five cuts have 2 peaks, one at price $1000 and the | other between $4000 and $5000. The exception is the Fair cut which has a single peak at $2500. | This gives us a little more understanding of the histogram we saw before.

Let’s move on to scatterplots. For these we’ll need to specify two variables from the diamond dataset.

|================================= | 37%

Let’s start with carat and price. Use these as the first 2 arguments of qplot. The third should be data set equal to the dataset. Try this now.

qplot(carat,price,data=diamonds)

qownnotes-media-o12260

qownnotes-media-o12260

We see the positive trend here, as the number of carats increases the price also goes up.
Now rerun the same command, except add a fourth parameter, shape, set equal to cut.

qplot(carat,price,data=diamonds,shape=cut + )

That’s correct!

|====================================== | 43%

The same scatterplot appears, except the cuts of the diamonds are distinguished by different symbols. The legend at the right tells you which symbol is associated with each cut. These are small and hard to read, so rerun the same command, except this time instead of setting the argument shape equal to cut, set the argument color equal to cut.
Now rerun the same command, except add a fourth parameter, shape, set equal to cut.

qplot(carat,price,data=diamonds,shape=cut + )

That’s correct!

|====================================== | 43%

qownnotes-media-h12260

qownnotes-media-h12260

The same scatterplot appears, except the cuts of the diamonds are distinguished by different symbols. The legend at the right tells you which symbol is associated with each cut. These are small and hard to read, so rerun the same command, except this time instead of setting the argument shape equal to cut, set the argument color equal to cut.

qplot(carat,price,data=diamonds,color=cut)

qownnotes-media-H12260

qownnotes-media-H12260

We’ll rerun the plot you just did (carat,price,data=diamonds and color=cut) but add an additional parameter. Use geom_smooth with the method set equal to the string “lm”.

qplot(carat,price,data=diamonds,color=cut)+geom_smooth(method=“lm”)

qownnotes-media-u12260

qownnotes-media-u12260

Again, we see the same scatterplot, but slightly more compressed and showing 5 regression lines, one for each cut of diamonds. It might be hard to see, but around each line is a shadow showing the 95% confidence interval. We see, unsurprisingly, that the better the cut, the steeper (more positive) the slope of the lines.

Finally, let’s rerun that plot you just did qplot(carat,price,data=diamonds, color=cut) + | geom_smooth(method=“lm”) but add one (just one) more argument to qplot. The new argument is | facets and it should be set equal to the formula .~cut. Recall that the facets argument | indicates we want a multi-panel plot. The symbol to the left of the tilde indicates rows (in | this case just one) and the symbol to the right of the tilde indicates columns (in this five, | the number of cuts). Try this now.

qplot(carat,price,data=diamonds,color=cut,facets = .~cut)+geom_smooth(method=“lm”)

qownnotes-media-m12260

qownnotes-media-m12260

First create a graphical object g by assigning to it the output of a call to the function ggplot with 2 arguments. The first is the dataset diamonds and the second is a call to the function aes with 2 arguments, depth and price. Remember you won’t see any result.

g <- ggplot(diamonds,aes(x=depth,y=price))

Type g+geom_point(alpha=1/3) at the command prompt.

g+geom_point(alpha=1/3)

qownnotes-media-q12260

qownnotes-media-q12260

That’s somewhat interesting. We see that depth ranges from 43 to 79, but the densest distribution is around 60 to 65. Suppose we want to see if this relationship (between depth and price) is affected by cut or carat. We know cut is a factor with 5 levels (Fair, Good, Very Good, Premium, and Ideal). But carat is numeric and not a discrete factor. Can we do this?
Of course! That’s why we asked. R has a handy command, cut, which allows you to divide your data into sets and label each entry as belonging to one of the sets, in effect creating a new factor. First, we’ll have to decide where to cut the data.
Let’s divide the data into 3 pockets, so 1/3 of the data falls into each. We’ll use the R command quantile to do this. Create the variable cutpoints and assign to it the output of a call to the function quantile with 3 arguments. The first is the data to cut, namely diamonds$carat; the second is a call to the R function seq. This is also called with 3 arguments, (0, 1, and length set equal to 4). The third argument to the call to quantile is the boolean na.rm set equal to TRUE.
Type cutpoints <- quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE) at the command prompt.

cutpoints <- quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE)

Look at cutpoints now to understand what it is.

cutpoints 0% 33.33333% 66.66667% 100% 0.20 0.50 1.00 5.01

That’s correct!

|======================================================================= | 80%

We see a 4-long vector (explaining why length was set equal to 4). We also see that .2 is the smallest carat size in the dataset and 5.01 is the largest. One third of the diamonds are between .2 and .5 carats and another third are between .5 and 1 carat in size. The remaining third are between 1 and 5.01 carats. Now we can use the R command cut to label each of the 53940 diamonds in the dataset as belonging to one of these 3 factors. Create a new name in diamonds, diamonds$car2 by assigning it the output of the call to cut. This command takes 2 arguments, diamonds$carat, which is what we want to cut, and cutpoints, the places where we’ll cut.

diamonds\(car2 <- cut(diamonds\)carat,cutpoints)

Perseverance, that’s the answer.

|========================================================================= | 81%

Now we can continue with our multi-facet plot. First we have to reset g since we changed the dataset (diamonds) it contained (by adding a new column). Assign to g the output of a call to ggplot with 2 arguments. The dataset diamonds is the first, and a call to the function aes with 2 arguments (depth,price) is the second.
Now we can continue with our multi-facet plot. First we have to reset g since we changed the dataset (diamonds) it contained (by adding a new column). Assign to g the output of a call to ggplot with 2 arguments. The dataset diamonds is the first, and a call to the function aes with 2 arguments (depth,price) is the second.

g <- ggplot(diamonds,aes(x=depth,y=price))

Excellent job!

|========================================================================== | 83%

Now add to g calls to 2 functions. This first is a call to geom_point with the argument alpha set equal to 1/3. The second is a call to the function facet_grid using the formula cut ~ car2 as its argument.

g+geom_point(alpha=1/3)+facet_grid(cut ~ car2)

qownnotes-media-N12260

qownnotes-media-N12260

We see a multi-facet plot with 5 rows, each corresponding to a cut factor. Not surprising. What is surprising is the number of columns. We were expecting 3 and got 4. Why?

|============================================================================= | 87%

The first 3 columns are labeled with the cutpoint boundaries. The fourth is labeled NA and shows us where the data points with missing data (NA or Not Available) occurred. We see that there were only a handful (12 in fact) and they occurred in Very Good, Premium, and Ideal cuts. We created a vector, myd, containing the indices of these datapoints. Look at these entries in diamonds by typing the expression diamonds[myd,]. The myd tells R what rows to show and the empty column entry says to print all the columns.
The first 3 columns are labeled with the cutpoint boundaries. The fourth is labeled NA and shows us where the data points with missing data (NA or Not Available) occurred. We see that there were only a handful (12 in fact) and they occurred in Very Good, Premium, and Ideal cuts. We created a vector, myd, containing the indices of these datapoints. Look at these entries in diamonds by typing the expression diamonds[myd,]. The myd tells R what rows to show and the empty column entry says to print all the columns.

diamonds[myd,] A tibble: 12 x 12 carat cut color clarity depth table price x y z carat2 car2 1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27 NA NA 2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26 NA NA 3 0.2 Premium E VS2 59.0 60 367 3.81 3.78 2.24 NA NA 4 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32 NA NA 5 0.2 Premium E VS2 59.7 62 367 3.84 3.80 2.28 NA NA 6 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.30 NA NA 7 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33 NA NA 8 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33 NA NA 9 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36 NA NA 10 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33 NA NA 11 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31 NA NA 12 0.2 Premium D VS2 61.7 60 367 3.77 3.72 2.31 NA NA

All that practice is paying off!

|=============================================================================== | 89%

We see these entries match the plots. Whew - that’s a relief. The car2 field is, in fact, NA for these entries, but the carat field shows they each had a carat size of .2. What’s going on here?

|================================================================================= | 91%

Actually our plot answers this question. The boundaries for each column appear in the gray labels at the top of each column, and we see that the first column is labeled (0.2,0.5]. This indicates that this column contains data greater than .2 and less than or equal to .5. So diamonds with carat size .2 were excluded from the car2 field.
Finally, recall the last plotting command (g+geom_point(alpha=1/3)+facet_grid(cut~car2)) or retype it if you like and add another call. This one to the function geom_smooth. Pass it 3 arguments, method set equal to the string “lm”, size set equal to 3, and color equal to the string “pink”.

g+geom_point(alpha=1/3)+facet_grid(cut ~ car2)+geom_smooth(method=“lm”,size=3,color=“pink”)

qownnotes-media-y12260

qownnotes-media-y12260

Nice thick regression lines which are somewhat interesting. You can add labels to the plot if you want but we’ll let you experiment on your own.

|====================================================================================== | 96%

Lastly, ggplot2 can, of course, produce boxplots. This final exercise is the sum of 3 function calls. The first call is to ggplot with 2 arguments, diamonds and a call to aes with carat and price as arguments. The second call is to geom_boxplot with no arguments. The third is to facet_grid with one argument, the formula . ~ cut. Try this now.

ggplot(diamonds,aes(x=carat,y=price))+geom_boxplot()+facet_grid(. ~ cut)

qownnotes-media-W12260

qownnotes-media-W12260

Yes! A boxplot looking like marshmallows about to be roasted. Well done and congratulations! You’ve finished this jewel of a lesson. Hope it payed off!

Hierarchical clustering

Es una técnica que nos permite agrupar los datos de una manera jerarquizada y nos es útil para comprender la naturaleza de los datos con los que estamos tratando.

Se trata de ir encontrando puntos que se vayan agrupando según la distancia de los datos a estos, de manera que en cada nivel de la jerarquía se obtengan menos puntos que en el anterior. La gráfica que muestra esta jerarquía se suele llamar dendograma.

Para conseguir esta clustering jerárquico:

  • Encontar los dos puntos más cercanos
  • Ponerlos juntos
  • Encontar el siguiente más cercano

Para esto es necesario:

  1. Definir un criterio para medir la distancia
  2. Definir un criterio para añadir el punto al grupo o no añadirlo

Existen distintos métodos para medir la distancia:

  • Distancia Euclídea (contínua)
  • Similitud correlacional (contínua)
  • Distancia Manhattan (binaria)

Es importante definir el modelo que mejor se adapte a nuestros datos para evitar garbaje in –> garbaje out

qownnotes-media-T16180

qownnotes-media-T16180

La distancia Manhattan presupone que no puedes trazar una línea recta entre los dos puntos (como si estuvieras en mahattan) y se trata sumar todos los tramos para ver qué suma total es menor, en el ejemplo roja, amarilla y azul. La línea verde representaría la distancia euclídea, que en este caso nos es imposible calcular porque no podemos atravesar los bloques.

qownnotes-media-C16180

qownnotes-media-C16180

Poner el seed para que sea reproducible y obtengamos los mismo resultados

A continuación se muestra un ejemplo de cómo se llevaría a cabo un cústering jerárquico con un conjunto de puntos distribuídos de manera aleatoria normalizada.

qownnotes-media-n16180

qownnotes-media-n16180

Con la función dist obenemos una matriz de distancias entre todo los puntos, por defecto utilizando la distancia euclídea como criterio.

qownnotes-media-h16180

qownnotes-media-h16180

Para comenzar a hacer el cálculo, de lo que se trata es de ir recorriendo los pares de puntos más cercanos entre sí y crear un nuevo punto que caiga justo en el centro geometrico entre los dos más cercanos que hayamos encontrado (complete linkage) en este caso el 5 y el 6, aunque hay otras maneras de calcularlo como linkage average, que coge los dos puntos más lejanos entre los clusters y genera dos puntos en base su linkage average que es la distancia entre sus dos puntos de gravedad. No hay un manera mejor o peor, lo recomendable es pfrobar ambas y ver qué se ajusta mejor a lo que buscamos.

qownnotes-media-C16180

qownnotes-media-C16180

Ese nuevo punto entrará en la sigueinte iteración para encontrar los sigueintes 2 puntos más cercanos, en este ejemplo serían 10 y 11, y crearíamos un nuevo punto en medio y así sucesivamente se va generando lo que se llama un dendrograma

qownnotes-media-B16180

qownnotes-media-B16180

En este punto tendremos que elegir en qué nivel desamos hacer el corte para obterner los cluster, en el nivel 2 por ejemplo hay 2 clusters, y en le nivel 1 hay 3.

Hay otras librerías más evolucionadas además de la que viene por defecto en R, que te colorea los resultados

qownnotes-media-Z16180

qownnotes-media-Z16180

Tienes que indicarles en qué nivel quieres hacer el cluster

qownnotes-media-K16180

qownnotes-media-K16180

Otra manera de mostrar este clusteríng jerarquizado es a través de un mapa de calor.

Un mapa de calor utiliza estos métodos para ver la relación entre una o más columnas. Aqué vemos que nicialmente parece que son 3 clusters

qownnotes-media-u16180

qownnotes-media-u16180

Una vez hecho este estudio, es posible que tengamos que hacerlo de nuevo para mejorar la visión de lo que más se acerque a la realidad. Podemos mejorarlo con las siguientes consideraciónes:

  • Incluir o no puntos outliers que pueden estar influenciando de manera negativa en el resultado
  • Tener distintos missing values
  • Escoger una distancia distinta
  • Escoger un criterio de agrupación distinto
  • Cambiar la escala o las unidades de una las variables.

Kmeans clustering and dimension reduction

Es una técica para dividir las observaciones en un número determinado de clústers. Su aproximación es la siguiente:

  • Fija un número de clústers por defecto
  • Toma los centroides de cada clúster
  • Assigna puntos a cada centroide
  • Recalcula los centroides

Para poder llevar a cabo este estudio, requiere que se defina el criterio de distancia, el número de clústers y un la elección aproximada del primer centroide.

Kmeans comienza eligiendo centroides aleatorios

qownnotes-media-V16180

qownnotes-media-V16180

se va iterando y reculando los centroides por ejemplo con las medias de las distnacias al cendoroide

qownnotes-media-m16180

qownnotes-media-m16180

qownnotes-media-z16180

qownnotes-media-z16180

Si queremos pintar los resultados, podemos hacer uso del vector “centers” y las dibujamos junto al plot

qownnotes-media-m16180

qownnotes-media-m16180

Existe otra manera de visualizar la información que nos da kmeans que es através de los mapas de calor

En esta imagen mostramos los datos a la izquierda tal cual son, y a la derecha ordenados por cada uno de los clusters detectados. Nos puede servir bien para detectar patrones en los datos.

qownnotes-media-Z16180

qownnotes-media-Z16180

Dimensional reduction

Tomamos una matriz con datos aleatorios normalizados, y vemos por medio de la función image que el gráfico que sale tiene mucho ruido y no se ven patrones gráficos fácilemente. La matriz la vamos a representar con con 10 filas y 40 columnas, con lo que podríamos identificar 10 variables y 40 filas de datos.

set.seed(12345); par(mar=rep(0.2,4))
dataMatrix <- matrix(rnorm(400),nrow=40)
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])

Podemos ver el dendrogama tanto en sus filas como sus columnas que tampoco parece decirnos mucho

par(mar=rep(0.2,4))
heatmap(dataMatrix)

vamos añadir un patrón a los datos,como por ejemplo sumar a todas las columnas una cantidad fija ( 0 a las 5 primeras y 3 a las 5 siguientes) a unas filas sí y a otras no de manera aleatoria

set.seed(678910)
for(i in 1:40){
  # flip a coin
  coinFlip <- rbinom(1,size=1,prob=0.5)
  # if coin is heads add a common pattern to that row
  if(coinFlip){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5)
  }
}
par(mar=rep(0.2,4))
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])

En las filas parece que no hay patrones, sin embargo en las columnas sí vemos dos claros grupos, en las que hemos sumado 0 y en las que hemos sumado 3. Por tanto lo siguente que podemos hacer es un análisis más profundo y hacer un dendograma para ver si identificamos este patrón en las columnas

par(mar=rep(0.2,4))
heatmap(dataMatrix)

Una cosa que puede ayudarnos también en este punto, es además del mapa de calor, en el que sí podemos ver una vez ordenadas las columnas que existe un patrón tanto en filas (Se aprecia un cambio a la mitad más o menos) como en columnas, sacar la media por las 40 filas, y la media por las 10 columnas y ver si siguen un patrón, en la g??afica de abajo se puede apreciar claramente este patrón en ambas gráficas

hh <- hclust(dist(dataMatrix)); dataMatrixOrdered <- dataMatrix[hh$order,]
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(rowMeans(dataMatrixOrdered),40:1,,xlab="Row Mean",ylab="Row",pch=19)
plot(colMeans(dataMatrixOrdered),xlab="Column",ylab="Column Mean",pch=19)

El haber encontrado estos patrones en los datos nos va a ser muy útil más adelante para poder hacer nuestro modelo, ya que nos puede permitir reducir el número de variables a las que más varianza explican en nuestro modelo. Esto es lo que se llama reducción de la dimensión.

  • Encontrar un conjunto de variables que no estén correlaccionadas y explican tanta varianza en los datos como sea posible. Por ejemplo peso y altura serán dos variables muy correlacionadas entre sí, y no nos aportarían variabilidad, por lo que se podría valorar no contar con una de ellas.

  • Si se quiere poner todas las variables juntas en una matriz, encontrar la mejor matriz creada con las menores variables que explican los datos originales.

La primera meta es estadística y la segunta meta es compresión de datos.

PRINCIPAL COMPONENTS ANALYSIS Y SINGULAR VALUE DECOMPOSITION (PCA/SVD)

Para llevar a cabo la copresión de datos y elegir únicamente las variables que explican la mayor parte de la varianza, existe la técnica SVD, en la que, a partir de la matriz de filas y columnas, se forma una matriz de descomposición, formada por tres matrices llamadas DV

* U --> Las columnas son ortogonales (son independientes entre sí). Se llama Left Singular Vector(LSV)
* V --> Las columnas también son ortogonales, llamada Right Singular Vectors (RSV)
* D --> Matriz diagonal que contiene los valores singulares y explica la varianza. Singular Values Vector

Además de esta primera, se suele usar otra técnica llamada PCA que utiliza una matriz SVD, habiendo escalado primero los datos (toma cada columna que se quieren comparar y computa cada una de ellos multiplicandolo por la media y dividiendolo por la desviación estándar de cada columna).

Los componentes principales de PCA deben ser iguales al RSV de SVD.

Podemos ver de nuevo esto a nivel gráfico, en la que vemos por ejemplo que las medias de las columnas se distribuyen claramente en un patrón, y en el caso de las medias de filas de la misma manera a partir de la fila 18 cambia el patrón. En ambos casos apreciamos claramente que introdujimos un patrón anteriormente a los datos, esto se refleja también en las matrices U y V.

En la matriz U a partir de la fila fila 18, se observa que los valores pasan de ser negativos a ser positivos, lo que está plasmando un cambio en la media de las filas, así como vemos plasmado este cambio también en a partir de la columna 5 en la matriz V. En este caso los datos tanto de las observaciones (filas) como de las variables (columnas) son muy homogéneos por lo que las medias de ambos ámbitos ya nos muestran este patrón, sin embargo habrá casos en los que los datos no estén tan preparados y las matrices U y V nos ayuden realmente a observar estos patrones.

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,1],40:1,,xlab="Row",ylab="First left singular vector",pch=19)
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)

Podemos ver también el componente D, que es la matriz diagonal. Los valores singulares explican de manera “resumida” la varianza que aporta a nuestros datos cada uno de los componentes (columnas). El primer componente es el que más varianza muestra y así sucesivamente.

En el segunfo gráfico muestra lo mismo pero en porcentajes, se ve la proporción de esta varianza, el primer punto captura el 40% de la varianza en nuestros datos, casi la mitad de la varianza de todos los datos está contenida en este Singular Value, que se puede considerar una dimensión.

par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)

Para ver la relación entre los Principal Components de PCA y los Right Singular Values de SVD, podemos ver el sigueinte gráfico cómo son exactamente los mismos valores, como hemos mencionado anteriormente

svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered,scale=TRUE)
plot(pca1$rotation[,1],svd1$v[,1],pch=19,xlab="Principal Component 1",ylab="Right Singular Vector 1")
abline(c(0,1))

Para verlo más claramente, creamos una matriz con 0s y 1s, sólo hay un patrón en estos datos, las 5 primeras columnas son 0s y las 5 siguientes son 1s, se ve claramente que sólo hay una dimensión que explica el 100% de la varianza. Es decir con sólo 1 de las columnas podríamos conocer el 100% de la variabilidad de nuestros datos. Esto se ve claramente en el vector de valores singulares

constantMatrix <- dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]){constantMatrix[i,] <- rep(c(0,1),each=5)}
svd1 <- svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1])
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)

Vamos a añadir a estos datos dos patrones, uno que recorra las filas con un shift y otro para las columnas que es altern, se ve claramente en las gráficas, las medias de lasfilas del 1 al 5 son 0 y del 6 al 10 son 1, mientras que las medias de las columnas son 1 y 0 alternadas

set.seed(678910)
for(i in 1:40){
  # flip a coin
  coinFlip1 <- rbinom(1,size=1,prob=0.5)
  coinFlip2 <- rbinom(1,size=1,prob=0.5)
  # if coin is heads add a common pattern to that row
  if(coinFlip1){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),each=5)
  }
  if(coinFlip2){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),5)
  }
}
hh <- hclust(dist(dataMatrix)); dataMatrixOrdered <- dataMatrix[hh$order,]
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,4))
image(t(dataMatrix)[,nrow(dataMatrix):1])
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(rep(c(0,1),each=5),pch=19,xlab="Column",ylab="Pattern 1")
plot(rep(c(0,1),5),pch=19,xlab="Column",ylab="Pattern 2")

Vamos a ver si conseguimos ver estos patrones haciendo el análisis SVD. Se puede ver en el primer Left Singular Vector (V) que las columnas del 1-5 son valores bajos, y del 6-10 más altos, con lo que aquí tendríamos identificado el primer patrón de las columnas. Para ver si hubiera un segundo patrón, tendríamos que recurrir al segundo Left Singluar Vector (V) donde no se ve tan fácilmente la alternancia en las medias pero sí se ve que cada punto es menor o mayor alternativamente que el sigueinte.

svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd2$v[,1],pch=19,xlab="Column",ylab="First right singular vector")
plot(svd2$v[,2],pch=19,xlab="Column",ylab="Second right singular vector")

Por último podemos recurrir a la matriz de Valores Singulares para ver qué peso tienen cada uno de estos posibles patrones detectados. Se puede observar que el primer patrón es muy drástico, que es el shift y es el primer punto llevandose el 0.5 del peso , el segundo patrón es sólo el 0.18 aprox ya que es un patrón de alternancia que es más ligero.

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Percent of variance explained",pch=19)

Un problema que tienen tanto la técnica SVD como por ende PCA es que si existen missing values la computación ,lógicamente al tratarse de operaciones sobre matrices, va a fallar. Para ello necesitaremos una estrategia para asignar un valor a los missing values

Imputar valores a los nulos

Como vemos aquí, SVD nos va a fallar al introducir valores NA

dataMatrix2 <- dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100,size=40,replace=FALSE)] <- NA
svd1 <- svd(scale(dataMatrix2))  ## Doesn't work!
## Error in svd(scale(dataMatrix2)): infinite or missing values in 'x'

Una alternativa es usar la función impute, compararndo los resultados de este incluyendo algunos NA al azar con el original vemos que no hay mucha variación

qownnotes-media-j13840

qownnotes-media-j13840

Vamos a ver un último ejemplo muy ilustrativo de lo que supondría elegir bien o mal la supresión de columnas en nuestros datos para representar fielmente el conjunto incial.

Partimos de un conjunto de datos que representan una cara

load("data/face.rda")
image(t(faceData)[,nrow(faceData):1])

Recurriendo a su matriz de valores singulares vemos que con las primeras 10 columnas del gráfico, podemos recoger cerca del 100% de la variabilidad de los datos.

svd1 <- svd(scale(faceData))
plot(svd1$d^2/sum(svd1$d^2),pch=19,xlab="Singular vector",ylab="Variance explained")

Crearemos 3 aproximaciones, la primera recoger sólo la primera columna, la segunda recogeremos las 5 primeras columnas y por último las 10 primeras columnas.

Se puede observar gráficamente muy bien, que tomando las 10 primeras columnas obtenemos una aproximación visual muy buena de lo que sería la cara inicial, mostrada en el último gráfico.

Este ejemplo, además de para el estudio, refleja una de las técnicas de compresión de imágenes que se puede utilizar para reducir el tamaño de una foto sin perder demasiada calidad.

svd1 <- svd(scale(faceData))
## Note that %*% is matrix multiplication

# Here svd1$d[1] is a constant
approx1 <- svd1$u[,1] %*% t(svd1$v[,1]) * svd1$d[1]

# In these examples we need to make the diagonal matrix out of d
approx5 <- svd1$u[,1:5] %*% diag(svd1$d[1:5])%*% t(svd1$v[,1:5])
approx10 <- svd1$u[,1:10] %*% diag(svd1$d[1:10])%*% t(svd1$v[,1:10]) 
par(mfrow=c(1,4))
image(t(approx1)[,nrow(approx1):1], main = "(a)")
image(t(approx5)[,nrow(approx5):1], main = "(b)")
image(t(approx10)[,nrow(approx10):1], main = "(c)")
image(t(faceData)[,nrow(faceData):1], main = "(d)")  ## Original data

Colors —-

La elección del color puede ser un factor muy importante a la hora de ayudar a encontrar patrones

qownnotes-media-s13840

qownnotes-media-s13840

Todos los códigos de colores:

http://cloford.com/resources/colours/500col.htm

colors() nos da en R los más habituales El paquete grDevices ayuda con esto

qownnotes-media-Y13840

qownnotes-media-Y13840

qownnotes-media-n13840

qownnotes-media-n13840

qownnotes-media-v13840

qownnotes-media-v13840

qownnotes-media-y13840

qownnotes-media-y13840

Esta función es muy similar, salvo que repersenta cada uno de los 3 colores en hexadecimal los dos primeros para rojo, los dos siguientes para verde y los dos últmos caracteres para azl (Rojo -> #FF 00 00)

qownnotes-media-N13840

qownnotes-media-N13840

qownnotes-media-q13840

qownnotes-media-q13840

qownnotes-media-h13840

qownnotes-media-h13840

qownnotes-media-j13840

qownnotes-media-j13840

qownnotes-media-v13840

qownnotes-media-v13840

qownnotes-media-D13840

qownnotes-media-D13840

qownnotes-media-H13840

qownnotes-media-H13840

qownnotes-media-h13840

qownnotes-media-h13840

qownnotes-media-g13840

qownnotes-media-g13840

EJERCICIOS

To give you an idea of what we’re talking about, consider these random points we generated. We’ll use them to demonstrate hierarchical clustering in this lesson. We’ll do this in several steps, but first we have to clarify our terms and concepts.
qownnotes-media-dQ4992

qownnotes-media-dQ4992

It’s pretty obvious that out of the 4 choices, the pair 5 and 6 were the closest together. However, there are several ways to measure distance or similarity. Euclidean distance and correlation similarity are continuous measures, while Manhattan distance is a binary measure. In this lesson we’ll just briefly discuss the first and last of these. It’s important that you use a measure of distance that fits your problem.
Euclidean distance is what you learned about in high school algebra. Given two points on a plane, (x1,y1) and (x2,y2), the Euclidean distance is the square root of the sums of the squares of the distances between the two x-coordinates (x1-x2) and the two y-coordinates (y1-y2). You probably recognize this as an application of the Pythagorean theorem which yields the length of the hypotenuse of a right triangle.

|============== | 16%

It shouldn’t be hard to believe that this generalizes to more than two dimensions as shown in the formula at the bottom of the picture shown here.
qownnotes-media-cM4992

qownnotes-media-cM4992

In this case, we can use Manhattan or city block distance (also known as a taxicab metric). This | picture, copied from http://en.wikipedia.org/wiki/Taxicab_geometry, shows what this means.

|=================== | 21%

You want to travel from the point at the lower left to the one on the top right. The shortest distance is the Euclidean (the green line), but you’re limited to the grid, so you have to follow a path similar to those shown in red, blue, or yellow. These all have the same length (12) which is the number of small gray segments covered by their paths.
qownnotes-media-WC4992

qownnotes-media-WC4992

Run the R command dist with the argument dataFrame to compute the distances between all pairs of these points. By default dist uses Euclidean distance as its metric, but other metrics such as Manhattan, are available. Just use the default.
qownnotes-media-NQ4992

qownnotes-media-NQ4992

dist(dataFrame)

qownnotes-media-nN4992

qownnotes-media-nN4992

So 0.0815 (units are unspecified) between points 5 and 6 is the shortest distance. We can put these points in a single cluster and look for another close pair of points.
qownnotes-media-Ws4992

qownnotes-media-Ws4992

So 10 and 11 are another pair of points that would be in a second cluster. We’ll start creating our dendrogram now. Here’re the original plot and two beginning pieces of the dendrogram.
qownnotes-media-qQ4992

qownnotes-media-qQ4992

hclust() and takes as an | argument the pairwise distance matrix which we looked at before. We’ve stored this matrix for you in a variable | called distxy. Run hclust now with distxy as its argument and put the result in the variable hc.

hc <- hclust(distxy)

Call the R function plot with one argument, hc.

plot(hc)

qownnotes-media-cs4992

qownnotes-media-cs4992

Nice plot, right? R’s plot conveniently labeled everything for you. The points we saw are the leaves at the bottom of the graph, 5 and 6 are connected, as are 10 and 11. Moreover, we see that the original 3 groupings of points are closest together as leaves on the picture. That’s reassuring. Now call plot again, this time with the argument as.dendrogram(hc).

class(hc) [1] “hclust”

plot(as.dendrogram(hc))

qownnotes-media-NR4992

qownnotes-media-NR4992

The essentials are the same, but the labels are missing and the leaves (original points) are all printed at the same level. Notice that the vertical heights of the lines and labeling of the scale on the left edge give some indication of distance. Use the R command abline to draw a horizontal blue line at 1.5 on this plot. Recall that this requires 2 arguments, h=1.5 and col=“blue”.
We see that this blue line intersects 3 vertical lines and this tells us that using the distance 1.5 (unspecified units) gives us 3 clusters (1 through 4), (9 through 12), and (5 through 8). We call this a “cut” of our dendrogram. Now cut the dendrogam by drawing a red horizontal line at .4.
qownnotes-media-IB4992

qownnotes-media-IB4992

So the number of clusters in your data depends on where you draw the line! (We said there’s a lot of flexibility here.) Now that we’ve seen the practice, let’s go back to some “theory”. Notice that the two original groupings, 5 through 8, and 9 through 12, are connected with a horizontal line near the top of the display. You’re probably wondering how distances between clusters of points are measured.

|=============================================== | 53%

There are several ways to do this. We’ll just mention two. The first is called complete linkage and it says that if you’re trying to measure a distance between two clusters, take the greatest distance between the pairs of points in those two clusters. Obviously such pairs contain one point from each cluster.

COMPLETE LINKAGE …

|================================================= | 55%

So if we were measuring the distance between the two clusters of points (1 through 4) and (5 through 8), using complete linkage as the metric we would use the distance between points 4 and 8 as the measure since this is the largest distance between the pairs of those groups.
qownnotes-media-Al4992

qownnotes-media-Al4992

For 3 clusters wold be llike this

qownnotes-media-sU4992

qownnotes-media-sU4992

We’ve created the dataframe dFsm for you containing these 3 points, 4, 8, and 11. Run dist on dFsm to see what the smallest distance between these 3 points is.

dist(dFsm)

qownnotes-media-mV4992

qownnotes-media-mV4992

Nice work!

|======================================================= | 61%

We see that the smallest distance is between points 2 and 3 in this reduced set, (these are actually points 8 and 11 in the original set), indicating that the two clusters these points represent ((5 through 8) and (9 through 12) respectively) would be joined (at a distance of 1.869) before being connected with the third cluster (1 through 4). This is consistent with the dendrogram we plotted.

AVERAGE LINKAGE

The second way to measure a distance between two clusters that we’ll just mention is called average linkage. First you compute an “average” point in each cluster (think of it as the cluster’s center of gravity). You do this by computing the mean (average) x and y coordinates of the points in the cluster.

|========================================================= | 65%

Then you compute the distances between each cluster average to compute the intercluster distance.

|=========================================================== | 66%

Now look at the hierarchical cluster we created before, hc.
Now look at the hierarchical cluster we created before, hc.

hc

Call: hclust(d = distxy)

Cluster method : complete Distance : euclidean Number of objects: 12

In our simple set of data, the average and complete linkages aren’t that different, but in more complicated datasets the type of linkage you use could affect how your data clusters. It is a good idea to experiment with different methods of linkage to see the varying ways your data groups. This will help you determine the best way to continue with your analysis.

HEATMAPS

R provides a handy function to produce heat maps. It’s called heatmap. We’ve put the point data we’ve been using throughout this lesson in a matrix. Call heatmap now with 2 arguments. The first is dataMatrix and the second is col set equal to cm.colors(25). This last is optional, but we like the colors better than the default ones.
qownnotes-media-rO4992

qownnotes-media-rO4992

heatmap(dataMatrix,col=cm.colors(25))

qownnotes-media-FC4992

qownnotes-media-FC4992

We see an interesting display of sorts. This is a very simple heat map - simple because the data isn’t very complex. The rows and columns are grouped together as shown by colors. The top rows (labeled 5, 6, and 7) seem to be in the same group (same colors) while 8 is next to them but colored differently. This matches the dendrogram shown on the left edge. Similarly, 9, 12, 11, and 10 are grouped together (row-wise) along with 3 and 2. These are followed by 1 and 4 which are in a separate group. Column data is treated independently of rows but is also grouped.
qownnotes-media-Ap4992

qownnotes-media-Ap4992

qownnotes-media-WG4992

qownnotes-media-WG4992

This looks slightly more interesting than the heatmap for the point data. It shows a little better how the rows and columns are treated (clustered and colored) independently of one another. To understand the disparity in color (between the left 4 columns and the right 2) look at mt now.
See how four of the columns are all relatively small numbers and only two (disp and hp) are large? That explains the big difference in color columns. Now to understand the grouping of the rows call plot with one argument, the dendrogram object denmt we’ve created for you.

plot(denmt)

qownnotes-media-YW4992

qownnotes-media-YW4992

We see that this dendrogram is the one displayed at the side of the heat map. How was this created? Recall that we generalized the distance formula for more than 2 dimensions. We’ve created a distance matrix for you, distmt. Look at it now.

dist(mt)

qownnotes-media-fj4992

qownnotes-media-fj4992

See how these distances match those in the dendrogram? So hclust really works! Let’s review now.

KMEANS

In this lesson we’ll learn about k-means clustering, another simple way of examining and | organizing multi-dimensional data. As with hierarchical clustering, this technique is most | useful in the early stages of analysis when you’re trying to get an understanding of the data, | e.g., finding some pattern or relationship between different factors or variables.

Since clustering organizes data points that are close into groups we’ll assume we’ve decided on | a measure of distance, e.g., Euclidean.

As we said, k-means is a partioning approach which requires that you first guess how many | clusters you have (or want). Once you fix this number, you randomly create a “centroid” (a | phantom point) for each cluster and assign each point or observation in your dataset to the | centroid to which it is closest. Once each point is assigned a centroid, you readjust the | centroid’s position by making it the average of the points assigned to it.

When it’s finished k-means clustering returns a final position of each cluster’s centroid as well as the assignment of each data point or observation to a cluster.
Now we’ll step through this process using our random points as our data. The coordinates of these are stored in 2 vectors, x and y. We eyeball the display and guess that there are 3 clusters. We’ll pick 3 positions of centroids, one for each cluster.
We’ve created two 3-long vectors for you, cx and cy. These respectively hold the x- and y- coordinates for 3 proposed centroids. For convenience, we’ve also stored them in a 2 by 3 matrix cmat. The x coordinates are in the first row and the y coordinates in the second. Look at cmat now.

cmat [,1] [,2] [,3][1,] 1 1.8 2.5 [2,] 2 1.0 1.5

Excellent work!

|===================== | 24%

The coordinates of these points are (1,2), (1.8,1) and (2.5,1.5). We’ll add these centroids to the plot of our points. Do this by calling the R command points with 6 arguments. The first 2 are cx and cy, and the third is col set equal to the concatenation of 3 colors, “red”, “orange”, and “purple”. The fourth argument is pch set equal to 3 (a plus sign), the fifth is cex set equal to 2 (expansion of character), and the final is lwd (line width) also set equal to 2.

points(cx,cy,col=c(“red”,“orange”,“purple”),pch=3,cex=2,lwd=2)

qownnotes-media-p14404

qownnotes-media-p14404

Now we have to calculate distances between each point and every centroid. There are 12 data points and 3 centroids. How many distances do we have to calculate?

1: 15 2: 36 3: 108 4: 9

Selection: 2

You got it!

|=========================== | 30%

We’ve written a function for you called mdist which takes 4 arguments. The vectors of data points (x and y) are the first two and the two vectors of centroid coordinates (cx and cy) are the last two. Call mdist now with these arguments.
qownnotes-media-z14404

qownnotes-media-z14404

We’ve stored these distances in the matrix distTmp for you. Now we have to assign a cluster to each point. To do that we’ll look at each column and pick the minimum entry
R has a handy function which.min which you can apply to ALL the columns of distTmp with one call. Simply call the R function apply with 3 arguments. The first is distTmp, the second is 2 meaning the columns of distTmp, and the third is which.min, the function you want to apply to the columns of distTmp. Try this now.

apply(distTmp,2,which.min) [1] 2 2 2 1 3 3 3 1 3 3 3 3

We’ve stored the vector of cluster colors (“red”,“orange”,“purple”) in the array cols1 for you and we’ve also stored the cluster assignments in the array newClust. Let’s color the 12 data points according to their assignments. Again, use the command points with 5 arguments. The first 2 are x and y. The third is pch set to 19, the fourth is cex set to 2, and the last, col is set to cols1[newClust].

points(x,y,pch=19,cex=2,col=cols1[newClust])

qownnotes-media-p14404

qownnotes-media-p14404

We can use the R function tapply which applies “a function over a ragged array”. This means that every element of the array is assigned a factor and the function is applied to subsets of the array (identified by the factor vector). This allows us to take advantage of the factor vector newClust we calculated. Call tapply now with 3 arguments, x (the data), newClust (the factor array), and mean (the function to apply).

tapply(x,newClust,mean) 1 2 3 1.210767 1.010320 2.498011

Keep up the great work!

|========================================= | 46%

Repeat the call, except now apply it to the vector y instead of x.

tapply(y,newClust,mean) 1 2 3 1.730555 1.016513 1.354373

Great job!

|=========================================== | 48%

Now that we have new x and new y coordinates for the 3 centroids we can plot them. We’ve stored off the coordinates for you in variables newCx and newCy. Use the R command points with these as the first 2 arguments. In addition, use the arguments col set equal to cols1, pch equal to 8, cex equal to 2 and lwd also equal to 2.

points(newCx,newCy,col=cols1,pch=8,cex=2,lwd=2)

qownnotes-media-n14404

qownnotes-media-n14404

We see how the centroids have moved closer to their respective clusters. This is especially true of the second (orange) cluster. Now call the distance function mdist with the 4 arguments x, y, newCx, and newCy. This will allow us to reassign the data points to new clusters if necessary
We’ve stored off this new matrix of distances in the matrix distTmp2 for you. Recall that the first cluster is red, the second orange and the third purple. Look closely at columns 4 and 7 of distTmp2. What will happen to points 4 and 7?

1: They will both change clusters 2: Nothing 3: They will both change to cluster 2 4: They’re the only points that won’t change clusters

Selection: 1

All that hard work is paying off!

|================================================ | 54%

Now call apply with 3 arguments, distTmp2, 2, and which.min to find the new cluster assignments for the points.

apply(distTmp2,2,which.min) [1] 2 2 2 2 3 3 1 1 3 3 3 3

Keep up the great work!

|================================================== | 56%

We’ve stored off the new cluster assignments in a vector of factors called newClust2. Use the R function points to recolor the points with their new assignments. Again, there are 5 arguments, x and y are first, followed by pch set to 19, cex to 2, and col to cols1[newClust2].

points(x,y,pch=19,cex=2,col=cols1[newClust2])

You’re the best!

|==================================================== | 58%

Notice that points 4 and 7 both changed clusters, 4 moved from 1 to 2 (red to orange), and point 7 switched from 3 to 2 (purple to red).
qownnotes-media-V14404

qownnotes-media-V14404

Now use tapply to find the x coordinate of the new centroid. Recall there are 3 arguments, x, newClust2, and mean.

tapply(x,newClust2,mean)

    1         2         3 

1.8878628 0.8904553 2.6001704

Do the same to find the new y coordinate.

tapply(y,newClust2,mean)

   1        2        3 

2.157866 1.006871 1.274675

Perseverance, that’s the answer.

|========================================================= | 64%

We’ve stored off these coordinates for you in the variables finalCx and finalCy. Plot these new centroids using the points function with 6 arguments. The first 2 are finalCx and finalCy. The argument col should equal cols1, pch should equal 9, cex 2 and lwd 2.

points(finalCx,finalCy,col=cols1,pch=9,cex=2,lwd=2)

You got it!

|=========================================================== | 66%

It should be obvious that if we continued this process points 5 through 8 would all turn red, while points 1 through 4 stay orange, and points 9 through 12 purple.
qownnotes-media-m14404

qownnotes-media-m14404

Now that you’ve gone through an example step by step, you’ll be relieved to hear that R provides a command to do all this work for you. Unsurprisingly it’s called kmeans and, although it has several parameters, we’ll just mention four. These are x, (the numeric matrix of data), centers, iter.max, and nstart. The second of these (centers) can be either a number of clusters or a set of initial centroids. The third, iter.max, specifies the maximum number of iterations to go through, and nstart is the number of random starts you want to try if you specify centers as a number.
qownnotes-media-s14404

qownnotes-media-s14404

The program returns the information that the data clustered into 3 clusters each of size 4. It also returns the coordinates of the 3 cluster means, a vector named cluster indicating how the 12 points were partitioned into the clusters, and the sum of squares within each cluster. It also shows all the available components returned by the function. We’ve stored off this data for you in a kmeans object called kmObj. Look at kmObj$iter to see how many iterations the algorithm went through.

kmObj$iter [1] 2

All that hard work is paying off!

|================================================================== | 74%

Two iterations as we did before. We just want to emphasize how you can access the information available to you. Let’s plot the data points color coded according to their cluster. This was stored in kmObj$cluster. Run plot with 5 arguments. The data, x and y, are the first two; the third, col is set equal to kmObj$cluster, and the last two are pch and cex. The first of these should be set to 19 and the last to 2.

plot(x,y,col=kmObj$cluster,pch=19,cex=2)

Now add the centroids which are stored in kmObj$centers. Use the points function with 5 arguments. The first two are kmObj$centers and col=c(“black”,“red”,“green”). The last three, pch, cex, and lwd, should all equal 3
qownnotes-media-Y14404

qownnotes-media-Y14404

Now for some fun! We want to show you how the output of the kmeans function is affected by its | random start (when you just ask for a number of clusters). With random starts you might want to | run the function several times to get an idea of the relationships between your observations. | We’ll call kmeans with the same data points (stored in dataFrame), but ask for 6 clusters | instead of 3.

We’ll plot our data points several times and each time we’ll just change the argument col which will show us how the R function kmeans is clustering them. So, call plot now with 5 arguments. The first 2 are x and y. The third is col set equal to the call kmeans(dataFrame,6)$cluster. The last two (pch and cex) are set to 19 and 2 respectively.

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

qownnotes-media-g14404

qownnotes-media-g14404

See how the points cluster? Now recall your last command and rerun it.

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

qownnotes-media-c14404

qownnotes-media-c14404

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

qownnotes-media-p14404

qownnotes-media-p14404

So the clustering changes with different starts. Perhaps 6 is too many clusters? Let’s review!

DIMENSION REDUCTION PCA/SVD

In this lesson we’ll discuss principal component analysis (PCA) and singular value decomposition (SVD), two important | and related techniques of dimension reduction. This last entails processes which finding subsets of variables in | datasets that contain their essences. PCA and SVD are used in both the exploratory phase and the more formal | modelling stage of analysis. We’ll focus on the exploratory phase and briefly touch on some of the underlying theory.

We’ll begin with a motivating example - random data.

qownnotes-media-j14500

qownnotes-media-j14500

This is dataMatrix, a matrix of 400 random normal numbers (mean 0 and standard deviation 1). We’re displaying it with the R command image. Run the R command head with dataMatrix as its argument to see what dataMatrix looks like.
qownnotes-media-r14500

qownnotes-media-r14500

heatmap(dataMatrix)

qownnotes-media-p14500

qownnotes-media-p14500

We can see that even with the clustering that heatmap provides, permuting the rows (observations) and columns (variables) independently, the data still looks random.
Let’s add a pattern to the data. We’ve put some R code in the file addPatt.R for you. Run the command myedit with the single argument “addPatt.R” (make sure to use the quotation marks) to see the code. You might have to click your cursor in the console after you do this to keep from accidentally changing the file.

set.seed(678910)

for(i in 1:40){ # flip a coin coinFlip <- rbinom(1,size=1,prob=0.5) # if coin is heads add a common pattern to that row, add 0 to first 5 columns and 3 to last 5

if(coinFlip){ dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5) } }

So in rows affected by the coin flip, the 5 left columns will still have a mean of 0 but the right 5 columns will have a mean closer to 3. Here’s the image of the altered dataMatrix after the pattern has been added. The pattern is clearly visible in the columns of the matrix. The right half is yellower or hotter, indicating higher values in the matrix.
qownnotes-media-A14500

qownnotes-media-A14500

Now run the R command heatmap again with dataMatrix as its only argument. This will perform a hierarchical cluster analysis on the matrix.

heatmap(dataMatrix)

qownnotes-media-X14500

qownnotes-media-X14500

Again we see the pattern in the columns of the matrix. As shown in the dendrogram at the top of the display, these split into 2 clusters, the lower numbered columns (1 through 5) and the higher numbered ones (6 through 10). Recall from the code in addPatt.R that for rows selected by the coinflip the last 5 columns had 3 added to them. The rows still look random.
Now consider this picture. On the left is an image similar to the heatmap of dataMatix you just plotted. It is an image plot of the output of hclust(), a hierarchical clustering function applied to dataMatrix. Yellow indicates “hotter” or higher values than red. This is consistent with the pattern we applied to the data (increasing the values for some of the rightmost columns).
The middle display shows the mean of each of the 40 rows (along the x-axis). The rows are shown in the same order as the rows of the heat matrix on the left. The rightmost display shows the mean of each of the 10 columns. Here the column numbers are along the x-axis and their means along the y.
qownnotes-media-Z14500

qownnotes-media-Z14500

We see immediately the connection between the yellow (hotter) portion of the cluster image and the higher row means, | both in the upper right portion of the displays. Similarly, the higher valued column means are in the right half of | that display and lower colummn means are in the left half.

Now we’ll talk a little theory. Suppose you have 1000’s of multivariate variables X_1, … ,X_n. By multivariate we mean that each X_i contains many components, i.e., X_i = (X_{i1}, … , X_{im}. However, these variables (observations) and their components might be correlated to one another.
Which of the following would be an example of variables correlated to one another?

1: Heights and weights of members of a family 2: The depth of the Atlantic Ocean and what you eat for breakfast 3: Today’s weather and a butterfly’s wing position

Selection: 1

You are amazing!

|========================= | 23%

As data scientists, we’d like to find a smaller set of multivariate variables that are uncorrelated AND explain as much variance (or variability) of the data as possible. This is a statistical approach.
In other words, we’d like to find the best matrix created with fewer variables (that is, a lower rank matrix) that explains the original data. This is related to data compression.
Two related solutions to these problems are PCA which stands for Principal Component Analysis and SVD, Singular Value Decomposition. This latter simply means that we express a matrix X of observations (rows) and variables (columns) as the product of 3 other matrices, i.e., X=UDV^t. This last term (V^t) represents the transpose of the matrix V.
Here U and V each have orthogonal (uncorrelated) columns. U’s columns are the left singular vectors of X and V’s columns are the right singular vectors of X. D is a diagonal matrix, by which we mean that all of its entries not on the diagonal are 0. The diagonal entries of D are the singular values of X.
To illustrate this idea we created a simple example matrix called mat. Look at it now.
qownnotes-media-l14500

qownnotes-media-l14500

qownnotes-media-g14500

qownnotes-media-g14500

qownnotes-media-v14500

qownnotes-media-v14500

Now we’ll talk a little about PCA, Principal Component Analysis, “a simple, non-parametric method for extracting relevant information from confusing data sets." We’re quoting here from a very nice concise paper on this subject which can be found at http://arxiv.org/pdf/1404.1100.pdf. The paper by Jonathon Shlens of Google Research is called, A Tutorial on Principal Component Analysis.

|===================================== | 34%

Basically, PCA is a method to reduce a high-dimensional data set to its essential elements (not lose information) and explain the variability in the data. We won’t go into the mathematical details here, (R has a function to perform PCA), but you should know that SVD and PCA are closely related.

|====================================== | 35%

We’ll demonstrate this now. First we have to scale mat, our simple example data matrix. This means that we subtract the column mean from every element and divide the result by the column standard deviation. Of course R has a command, scale, that does this for you. Run svd on scale of mat.
qownnotes-media-C14500

qownnotes-media-C14500

qownnotes-media-a14500

qownnotes-media-a14500

qownnotes-media-N14500

qownnotes-media-N14500

To prove we’re not making this up, we’ve run svd on dataMatrix and stored the result in the object svd1. This has 3 components, d, u and v. look at the first column of V now. It can be viewed by using the svd1$v[,1] notation.
qownnotes-media-Z14500

qownnotes-media-Z14500

Here we again show the clustered data matrix on the left. Next to it we’ve plotted the first column of the U matrix associated with the scaled data matrix. This is the first LEFT singular vector and it’s associated with the ROW means of the clustered data. You can see the clear separation between the top 24 (around -0.2) row means and the bottom 16 (around 0.2). We don’t show them but note that the other columns of U don’t show this pattern so clearly.
qownnotes-media-c14500

qownnotes-media-c14500

The rightmost display shows the first column of the V matrix associated with the scaled and clustered data matrix. This is the first RIGHT singular vector and it’s associated with the COLUMN means of the clustered data. You can see the clear separation between the left 5 column means (between -0.1 and 0.1) and the right 5 column means (all below -0.4). As with the left singular vectors, the other columns of V don’t show this pattern as clearly as this first one does.
So the singular value decomposition automatically picked up these patterns, the differences in the row and column means.
Why were the first columns of both the U and V matrices so special? Well as it happens, the D matrix of the SVD explains this phenomenon. It is an aspect of SVD called variance explained. Recall that D is the diagonal matrix sandwiched in between U and V^t in the SVD representation of the data matrix. The diagonal entries of D are like weights for the U and V columns accounting for the variation in the data. They’re given in decreasing order from highest to lowest. Look at these diagonal entries now. Recall that they’re stored in svd1$d.
qownnotes-media-a14500

qownnotes-media-a14500

qownnotes-media-b14500

qownnotes-media-b14500

qownnotes-media-D14500

qownnotes-media-D14500

qownnotes-media-v14500

qownnotes-media-v14500

qownnotes-media-y14500

qownnotes-media-y14500

qownnotes-media-M14500

qownnotes-media-M14500

qownnotes-media-x14500

qownnotes-media-x14500

To see this more closely, look at the first 2 columns of the v component. We stored the SVD output in the svd object svd2.
qownnotes-media-B14500

qownnotes-media-B14500

qownnotes-media-m14500

qownnotes-media-m14500

qownnotes-media-H14500

qownnotes-media-H14500

So the first element which showed the difference between the left and right halves of the matrix accounts for roughly 50% of the variation in the matrix, and the second element which picked up the alternating pattern accounts for 18% of the variance. The remaining elements account for smaller percentages of the variation. This indicates that the first pattern is much stronger than the second. Also the two patterns confound each other so they’re harder to separate and see clearly. This is what often happens with real data.
Now you’re probably convinced that SVD and PCA are pretty cool and useful as tools for analysis, but one problem with them that you should be aware of, is that they cannot deal with MISSING data. Neither of them will work if any data in the matrix is missing. (You’ll get error messages from R in red if you try.) Missing data is not unusual, so luckily we have ways to work around this problem. One we’ll just mention is called imputing the data.
This uses the k nearest neighbors to calculate a values to use in place of the missing data. You may want to specify an integer k which indicates how many neighbors you want to average to create this replacement value. The bioconductor package (http://bioconductor.org) has an impute package which you can use to fill in missing data. One specific function in it is impute.knn.
We’ll move on now to a final example of the power of singular value decomposition and principal component analysis and how they work as a data compression technique.
Consider this low resolution image file showing a face. We’ll use SVD and see how the first several components contain most of the information in the file so that storing a huge matrix might not be necessary.
qownnotes-media-S14500

qownnotes-media-S14500

The image data is stored in the matrix faceData. Run the R command dim on faceData to see how big it is.

dim(faceData) [1] 32 32

You are amazing!

|================================================================================== | 75%

So it’s not that big of a file but we want to show you how to use what you learned in this lesson. We’ve done the SVD and stored it in the object svd1 for you. Here’s the plot of the variance explained.
qownnotes-media-X14500

qownnotes-media-X14500

Recall that the data matrix X is the product of 3 matrices, that is X=UDV^t. These are precisely what you get when you run svd on the matrix X.

|======================================================================================= | 80%

Suppose we create the product of pieces of these, say the first columns of U and V and the first element of D. The first column of U can be interpreted as a 32 by 1 matrix (recall that faceData was a 32 by 32 matrix), so we can multiply it by the first element of D, a 1 by 1 matrix, and get a 32 by 1 matrix result. We can multiply that by the transpose of the first column of V, which is the first principal component. (We have to use the transpose of V’s column to make it a 1 by 32 matrix in order to do the matrix multiplication properly.)

|========================================================================================= | 81%

Alas, that is how we do it in theory, but in R using only one element of d means it’s a constant. So we have to do the matrix multiplication with the %*% operator and the multiplication by the constant (svd1$d[1]) with the regular multiplication operator *.

|========================================================================================== | 82%

Try this now and put the result in the variable a1. Recall that svd1\(u, svd1\)d, and svd1$v contain all the information you need. NOTE that because of the peculiarities of R’s casting, if you do the scalar multiplication with the * operator first (before the matrix multiplication with the %*% operator) you MUST enclose the 2 arguments (svd1\(u[,1] and svd1\)d[1]) in parentheses.

a1 <- svd1\(u[,1] %*% t(svd1\)v[,1]) * svd1$d[1]

Now to look at it as an image. We wrote a function for you called myImage which takes a single argument, a matrix of data to display using the R function image. Run it now with a1 as its argument.

myImage(a1)

qownnotes-media-G14500

qownnotes-media-G14500

It might not look like much but it’s a good start. Now we’ll try the same experiment but this time we’ll use 2 elements from each of the 3 SVD terms.
Create the matrix a2 as the product of the first 2 columns of svd1$u, a diagonal matrix using the first 2 elements of svd1\(d, and the transpose of the first 2 columns of svd1\)v. Since all of your multiplicands are matrices you have to use only the operator %*% AND you DON’T need parentheses. Also, you must use the R function diag with svd1$d[1:2] as its sole argument to create the proper diagonal matrix. Remember, matrix multiplication is NOT commutative so you have to put the multiplicands in the correct order. Please use the 1:2 notation and not the c(m:n), i.e., the concatenate function, when specifying the columns.

a2 <- svd1\(u[,1:2] %*% diag(svd1\)d[1:2]) %*% t(svd1$v[,1:2])

< myImage(a2)

qownnotes-media-K14500

qownnotes-media-K14500

We’re starting to see slightly more detail, and maybe if you squint you see a grimacing mouth. Now let’s see what image results using 5 components. From our plot of the variance explained 5 components covered a sizeable percentage of the variation. To save typing, use the up arrow to recall the command which created a2 and replace the a2 and assignment arrow with the call to myImage, and change the three occurrences of 2 to 5.

myImage(svd1\(u[,1:5] %*% diag(svd1\)d[1:5]) %*% t(svd1$v[,1:5]))

qownnotes-media-h14500

qownnotes-media-h14500

Certainly much better. Clearly a face is appearing with eyes, nose, ears, and mouth recognizable. Again, use the up arrow to recall the last command (calling myImage with a matrix product argument) and change the 5’s to 10’s. We’ll see how this image looks.

myImage(svd1\(u[,1:10] %*% diag(svd1\)d[1:10]) %*% t(svd1$v[,1:10]))

qownnotes-media-O14500

qownnotes-media-O14500

Now that’s pretty close to the original which was low resolution to begin with, but you can see that 10 components | really do capture the essence of the image. Singular value decomposition is a good way to approximate data without | having to store a lot.

We’ll close now with a few comments. First, when reducing dimensions you have to pay attention to the scales on which different variables are measured and make sure that all your data is in consistent units. In other words, scales of your data matter. Second, principal components and singular values may mix real patterns, as we saw in our simple 2-pattern example, so finding and separating out the real patterns require some detective work. Let’s do a quick review now.
A matrix X has the singular value decomposition UDV^t. The principal components of X are ?
qownnotes-media-D14500

qownnotes-media-D14500

qownnotes-media-b14500

qownnotes-media-b14500

CLUSTERING

In this lesson we’ll apply some of the analytic techniques we learned in this course to data from the University of California, Irvine. Specifically, the data we’ll use is from UCI’s Center for Machine Learning and Intelligent Systems. You can find out more about the data at http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones. As this address indicates, the data involves smartphones and recognizing human activity. Cool, right?
Our goal is to show you how to use exploratory data analysis to point you in fruitful directions of research, that is, towards answerable questions. Exploratory data analysis is a “rough cut” or filter which helps you to find the most beneficial areas of questioning so you can set your priorities accordingly.
We’ve loaded data from this study for you in a matrix called ssd. Run the R command dim now to see its dimensions.

dim(ssd) [1] 7352 563

You got it right!

|======= | 8%

Wow - ssd is pretty big, 7352 observations, each of 563 variables. Don’t worry we’ll only use a small portion of this “Human Activity Recognition database”.
These last 2 columns contain subject and activity information. We saw above that the gathered data had “been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data." Run the R command table with ssd$subject as its argument to see if the data in ssd contains training or test data.

CASO DE ESTUDIO 2

In this lesson we’ll apply some of the techniques we learned in this course to study air pollution data, specifically particulate matter (we’ll call it pm25 sometimes), collected by the U.S. Environmental Protection Agency. This website https://www.health.ny.gov/environmental/indoors/air/pmq_a.htm from New York State offers some basic information on this topic if you’re interested.

|== | 2%

Particulate matter (less than 2.5 microns in diameter) is a fancy name for dust, and breathing in dust might pose health hazards to the population. We’ll study data from two years, 1999 (when monitoring of particulate matter started) and 2012. Our goal is to see if there’s been a noticeable decline in this type of air pollution between these two years.

|=== | 3%

We’ve read in 2 large zipped files for you using the R command read.table (which is smart enough to unzip the files). We stored the 1999 data in the array pm0 for you. Run the R command dim now to see its dimensions.

warning messages from top-level task callback ‘mini’ Warning messages: 1: package ‘fields’ was built under R version 3.3.3 2: package ‘spam’ was built under R version 3.3.3 3: package ‘dotCall64’ was built under R version 3.3.3 4: failed to assign RegisteredNativeSymbol for toeplitz to toeplitz since toeplitz is already defined in the ‘spam’ namespace 5: package ‘maps’ was built under R version 3.3.3 > dim(pm0) [1] 117421 5

Keep working like that and you’ll get there!

|==== | 4%

We see that pm0 has over 117000 lines, each containing 5 columns. In the original file, at the EPA website, each row had 28 columns, but since we’ll be using only a few of these, we’ve created and read in a somewhat smaller file. Run head on pm0 now to see what the first few lines look like.

head(pm0) V1 V2 V3 V4 V5 1 1 27 1 19990103 NA 2 1 27 1 19990106 NA 3 1 27 1 19990109 NA 4 1 27 1 19990112 8.841 5 1 27 1 19990115 14.920 6 1 27 1 19990118 3.878

That’s the answer I was looking for.

|===== | 5%

We see there’s some missing data, but we won’t worry about that now. We also see that the column names, V1, V2, etc., are not informative. However, we know that the first line of the original file (a comment) explained what information the columns contained.

|====== | 6%

We created the variable cnames containing the 28 column names of the original file. Take a look at the column names now.

cnames\(names Error in cnames\)names : $ operator is invalid for atomic vectors cnames [1] “# RD|Action Code|State Code|County Code|Site ID|Parameter|POC|Sample Duration|Unit|Method|Date|Start Time|Sample Value|Null Data Code|Sampling Frequency|Monitor Protocol (MP) ID|Qualifier - 1|Qualifier - 2|Qualifier - 3|Qualifier - 4|Qualifier - 5|Qualifier - 6|Qualifier - 7|Qualifier - 8|Qualifier - 9|Qualifier - 10|Alternate Method Detectable Limit|Uncertainty”

That’s a job well done!

|====== | 7%

We see that the 28 column names look all jumbled together even though they’re separated by “|” characters, so let’s fix this. Reassign to cnames the output of a call to strsplit (string split) with 3 arguments. The first is cnames, the pipe symbol ‘|’ is the second (use the quotation marks), and the third is the argument fixed set to TRUE. Try this now.

cnames <- strsplit(cnames,“|”,fixed = TRUE)

You are amazing!

|======= | 8%

The variable cnames now holds a list of the column headings. Take another look at the column names.

cnames [[1]][1] “# RD” “Action Code”
[3] “State Code” “County Code”
[5] “Site ID” “Parameter”
[7] “POC” “Sample Duration”
[9] “Unit” “Method”
[11] “Date” “Start Time”
[13] “Sample Value” “Null Data Code”
[15] “Sampling Frequency” “Monitor Protocol (MP) ID”
[17] “Qualifier - 1” “Qualifier - 2”
[19] “Qualifier - 3” “Qualifier - 4”
[21] “Qualifier - 5” “Qualifier - 6”
[23] “Qualifier - 7” “Qualifier - 8”
[25] “Qualifier - 9” “Qualifier - 10”
[27] “Alternate Method Detectable Limit” “Uncertainty”

Great job!

|======== | 9%

Nice, but we don’t need all these. Assign to names(pm0) the output of a call to the function make.names with cnames[[1]][wcol] as the argument. The variable wcol holds the indices of the 5 columns we selected (from the 28) to use in this lesson, so those are the column names we’ll need. As the name suggests, the function “makes syntactically valid names”.

names(pm0) <- make.names(cnames[[1]][wcol])

All that hard work is paying off!

|========= | 10%

Now re-run head on pm0 now to see if the column names have been put in place.

head(pm0) State.Code County.Code Site.ID Date Sample.Value 1 1 27 1 19990103 NA 2 1 27 1 19990106 NA 3 1 27 1 19990109 NA 4 1 27 1 19990112 8.841 5 1 27 1 19990115 14.920 6 1 27 1 19990118 3.878

Excellent work!

|========== | 11%

Now it’s clearer what information each column of pm0 holds. The measurements of particulate matter (pm25) are in the column named Sample.Value. Assign this component of pm0 to the variable x0. Use the m$n notation.

pm0\(x0 <- pm0\)Sample.Value

You almost had it, but not quite. Try again. Or, type info() for more options.
Type x0 <- pm0$Sample.Value at the command prompt.

px0 <- pm0$Sample.Value

One more time. You can do it! Or, type info() for more options.
Type x0 <- pm0$Sample.Value at the command prompt.

x0 <- pm0$Sample.Value

All that practice is paying off!

|=========== | 12%

Call the R command str with x0 as its argument to see x0’s structure.

str(x0) num [1:117421] NA NA NA 8.84 14.92 …

You got it right!

|============ | 13%

We see that x0 is a numeric vector (of length 117000+) with at least the first 3 values missing. Exactly what percentage of values are missing in this vector? Use the R function mean with is.na(x0) as an argument to see what percentage of values are missing (NA) in x0.

mean(is.na(x0)) [1] 0.1125608

That’s correct!

|============= | 14%

So a little over 11% of the 117000+ are missing. We’ll keep that in mind. Now let’s start processing the 2012 data which we stored for you in the array pm1.

|============== | 15%

We’ll repeat what we did for pm0, except a little more efficiently. First assign the output of make.names(cnames[[1]][wcol]) to names(pm1).

names(pm1) <- make.names(cnames[[1]][wcol])

Great job!

|=============== | 16%

Find the dimensions of pm1 with the command dim.

dim(pm1) [1] 1304287 5

Perseverance, that’s the answer.

|================ | 18%

Wow! Over 1.3 million entries. Particulate matter was first collected in 1999 so perhaps there weren’t as many sensors collecting data then as in 2012 when the program was more mature. If you ran head on pm1 you’d see that it looks just like pm0. We’ll move on though.

|================= | 19%

Create the variable x1 by assigning to it the Sample.Value component of pm1.

x1 <- pm1$Sample.Value

Nice work!

|================= | 20%

Now let’s see what percentage of values are missing in x1. As before, use the R function mean with is.na(x1) as an argument to find out.

mean(is.na(x1)) [1] 0.05607125

That’s a job well done!

|================== | 21%

So only 5.6% of the particulate matter measurements are missing. That’s about half the percentage as in 1999.

|=================== | 22%

Now let’s look at summaries (using the summary command) for both datasets. First, x0.

summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0.00 7.20 11.50 13.74 17.90 157.10 13217

Nice work!

|==================== | 23%

The numbers in the vectors x0 and x1 represent measurements taken in micrograms per cubic meter. Now look at the summary of x1.

summary(x1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s -10.00 4.00 7.63 9.14 12.00 909.00 73133

You’re the best!

|===================== | 24%

We see that both the median and the mean of measured particulate matter have declined from 1999 to 2012. In fact, all of the measurements, except for the maximum and missing values (Max and NA’s), have decreased. Even the Min has gone down from 0 to -10.00! We’ll address what a negative measurment might mean a little later. Note that the Max has increased from 157 in 1999 to 909 in 2012. This is quite high and might reflect an error in the table or malfunctions in some monitors.

|====================== | 25%

Call the boxplot function with 2 arguments, x0 and x1.
qownnotes-media-V12236

qownnotes-media-V12236

| Huh? Did somebody step on the boxes? It’s hard to see what’s going on here. There are so many | values outside the boxes and the range of x1 is so big that the boxes are flattened. It might be | more informative to call boxplot on the logs (base 10) of x0 and x1. Do this now using log10(x0) | and log10(x1) as the 2 arguments. > boxplot(log10(x0),log10(x1))

qownnotes-media-h12236

qownnotes-media-h12236

Warning messages: 1: In boxplot.default(log10(x0), log10(x1)) : NaNs produced 2: In bplt(at[i], wid = width[i], stats = z\(stats[, i], out = z\)out[z\(group == : Outlier (-Inf) in boxplot 1 is not drawn 3: In bplt(at[i], wid = width[i], stats = z\)stats[, i], out = z\(out[z\)group == : Outlier (-Inf) in boxplot 2 is not drawn

That’s the answer I was looking for.

|======================== | 27%

A bonus! Not only do we get a better looking boxplot we also get some warnings from R in Red. These let us know that some values in x0 and x1 were “unloggable”, no doubt the 0 (Min) we saw in the summary of x0 and the negative values we saw in the Min of the summary of x1.
Let’s return to the question of the negative values in x1. Let’s count how many negative values there are. We’ll do this in a few steps.

|=========================== | 30%

First, form the vector negative by assigning to it the boolean x1<0.

negative <- x1<0

You’re the best!

|============================ | 31%

Now run the R command sum with 2 arguments. The first is negative, and the second is na.rm set equal to TRUE. This tells sum to ignore the missing values in negative.

sum(negative,na.rm = TRUE) [1] 26474

That’s the answer I was looking for.

|============================ | 32%

So there are over 26000 negative values. Sounds like a lot. Is it? Run the R command mean with same 2 arguments you just used with the call to sum. This will tell us a percentage.

mean(negative,na.rm = TRUE) [1] 0.0215034

Excellent job!

|============================= | 33%

We see that just 2% of the x1 values are negative. Perhaps that’s a small enough percentage that we can ignore them. Before we ignore them, though, let’s see if they occur during certain times of the year.

|============================== | 34%

First create the array dates by assigning to it the Date component of pm1. Remember to use the x$y notation.

dates <- pm1$Date

All that hard work is paying off!

|=============================== | 35%

To see what dates looks like run the R command str on it.

str(dates) int [1:1304287] 20120101 20120104 20120107 20120110 20120113 20120116 20120119 20120122 20120125 20120128 …

Nice work!

|================================ | 36%

We see dates is a very long vector of integers. However, the format of the entries is hard to read. There’s no separation between the year, month, and day. Reassign to dates the output of a call to as.Date with the 2 arguments as.character(dates) as the first argument and the string “%Y%m%d” as the second.

dates <- as.Date(as.character())

One more time. You can do it! Or, type info() for more options.
Type dates <- as.Date(as.character(dates), “%Y%m%d”) at the command prompt.

dates <- as.Date(as.character(dates), “%Y%m%d”)

You got it right!

|================================= | 37%

Now when you run head on dates you’ll see the dates in a nicer format. Try this now.

head(dates) [1] “2012-01-01” “2012-01-04” “2012-01-07” “2012-01-10” “2012-01-13” “2012-01-16”

That’s the answer I was looking for.

|================================== | 38%

Let’s plot a histogram of the months when the particulate matter measurements are negative. Run hist with 2 arguments. The first is dates[negative] and the second is the string “month”.

hist(dates[negative],“month”)

qownnotes-media-G12236

qownnotes-media-G12236

We see the bulk of the negative measurements were taken in the winter months, with a spike in May. Not many of these negative measurements occurred in summer months. We can take a guess that because particulate measures tend to be low in winter and high in summer, coupled with the fact that higher densities are easier to measure, that measurement errors occurred when the values were low. For now we’ll attribute these negative measurements to errors. Also, since they account for only 2% of the 2012 data, we’ll ignore them.

|==================================== | 40%

Now we’ll change focus a bit and instead of looking at all the monitors throughout the country and the data they recorded, we’ll try to find one monitor that was taking measurements in both 1999 and 2012. This will allow us to control for different geographical and environmental variables that might have affected air quality in different areas. We’ll narrow our search and look just at monitors in New York State.

|===================================== | 41%

We subsetted off the New York State monitor identification data for 1999 and 2012 into 2 vectors, site0 and site1. Look at the structure of site0 now with the R command str.

str(site0) chr [1:33] “1.5” “1.12” “5.73” “5.80” “5.83” “5.110” “13.11” “27.1004” …

That’s the answer I was looking for.

|====================================== | 42%

We see that site0 (the IDs of monitors in New York State in 1999) is a vector of 33 strings, each of which has the form “x.y”. We’ve created these from the county codes (the x portion of the string) and the monitor IDs (the y portion). If you ran str on site1 you’d see 18 similar values.

|======================================= | 43%

Use the intersect command with site0 and site1 as arguments and put the result in the variable both.

intersect(site0,site1) [1] “1.5” “1.12” “5.80” “13.11” “29.5” “31.3” “63.2008” “67.1015” “85.55”
[10] “101.3”

Not quite, but you’re learning! Try again. Or, type info() for more options.
Type both <- intersect(site0, site1) at the command prompt.

both <- intersect(site0,site1)

Nice work!

|======================================= | 44%

Take a look at both now.

both [1] “1.5” “1.12” “5.80” “13.11” “29.5” “31.3” “63.2008” “67.1015” “85.55”
[10] “101.3”

All that hard work is paying off!

|======================================== | 45%

We see that 10 monitors in New York State were active in both 1999 and 2012.

|========================================= | 46%

To save you some time and typing, we modified the data frames pm0 and pm1 slightly by adding to each of them a new component, county.site. This is just a concatenation of two original components County.Code and Site.ID. We did this to facilitate the next step which is to find out how many measurements were taken by the 10 New York monitors working in both of the years of interest. Run head on pm0 to see the first few entries now.

head(pm0) State.Code County.Code Site.ID Date Sample.Value x0 county.site 1 1 27 1 19990103 NA NA 27.1 2 1 27 1 19990106 NA NA 27.1 3 1 27 1 19990109 NA NA 27.1 4 1 27 1 19990112 8.841 8.841 27.1 5 1 27 1 19990115 14.920 14.920 27.1 6 1 27 1 19990118 3.878 3.878 27.1

All that hard work is paying off!

|========================================== | 47%

Now pm0 and pm1 have 6 columns instead of 5, and the last column is a concatenation of two other columns, County and Site.

|=========================================== | 48%

Now let’s see how many measurements each of the 10 New York monitors that were active in both 1999 and 2012 took in those years. We’ll create 2 subsets (one for each year), one of pm0 and the other of pm1.

|============================================ | 49%

The subsets will filter for 2 characteristics. The first is State.Code equal to 36 (the code for New York), and the second is that the county.site (the component we added) is in the vector both.

|============================================= | 51%

First create the variable cnt0 by assigning to it the output of the R command subset, called with 2 arguments. The first is pm0, and the second is a boolean with the 2 conditions we just mentioned. Recall that the testing for equality in a boolean requires ==, intersection of 2 boolean conditions is denoted by & and membership by %in%.

cnt0 <- subset(pm0,State.code == 36 & Country.site %in% both) Error in eval(expr, envir, enclos) : object ‘State.code’ not found cnt0 <- subset(pm0,State.code==36&Country.site %in% both) Error in eval(expr, envir, enclos) : object ‘State.code’ not found cnt0 <- subset(pm0,State.Code==36&Country.site %in% both) Error in match(x, table, nomatch = 0L) : object ‘Country.site’ not found cnt0 <- subset(pm0,State.Code==36&County.site %in% both) Error in match(x, table, nomatch = 0L) : object ‘County.site’ not found cnt0 <- subset(pm0,State.Code==36&county.site %in% both)

Excellent job!

|============================================== | 52%

Recall the last command with the up arrow, and create cnt1 (instead of cnt0). Remember to change pm0 to pm1. Everything else can stay the same.

cnt1 <- subset(pm1,State.Code==36&county.site %in% both)

Excellent job!

|=============================================== | 53%

Now run the command sapply(split(cnt0, cnt0$county.site), nrow). This will split cnt0 into several data frames according to county.site (that is, monitor IDs) and tell us how many measurements each monitor recorded.

dim(cnt0) [1] 952 7

Not quite, but you’re learning! Try again. Or, type info() for more options.
Type sapply(split(cnt0, cnt0$county.site), nrow) at the command prompt.

dim(cnt1) [1] 328 6

Not quite, but you’re learning! Try again. Or, type info() for more options.
Type sapply(split(cnt0, cnt0$county.site), nrow) at the command prompt.

sapply(split(cnt0, cnt0$county.site), nrow) 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 61 122 152 61 61 183 61 122 122 7

That’s correct!

|================================================ | 54%

Do the same for cnt1. (Recall your last command and change 2 occurrences of cnt0 to cnt1.)

sapply(split(cnt1, cnt1$county.site), nrow) 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 31 64 31 31 33 15 31 30 31 31

Great job!

|================================================= | 55%

From the output of the 2 calls to sapply, which monitor is the only one whose number of measurements increased from 1999 to 2012?

1: 85.55 2: 101.3 3: 63.2008 4: 29.5

Selection: 1

You got it right!

|================================================== | 56%

We want to examine a monitor with a reasonable number of measurements so let’s look at the monitor with ID 63.2008. Create a variable pm0sub which is the subset of cnt0 (this contains just New York data) which has County.Code equal to 63 and Site.ID 2008.

pm0sub <- subset(cnt0,County.Code==63&Site.ID==2008)

Keep working like that and you’ll get there!

|================================================== | 57%

Now do the same for cnt1. Name this new variable pm1sub.

pm1sub <- subset(cnt1,County.Code==63&Site.ID==2008)

Excellent work!

|=================================================== | 58%

From the output of the 2 calls to sapply, how many rows will pm0sub have?

1: 30 2: 29 3: 183 4: 122

Selection: 4

You are amazing!

|==================================================== | 59%

Now we’d like to compare the pm25 measurements of this particular monitor (63.2008) for the 2 years. First, create the vector x0sub by assigning to it the Sample.Value component of pm0sub.

x0sub <- pm0sub$Sample.Value

Your dedication is inspiring!

|===================================================== | 60%

Similarly, create x1sub from pm1sub.

x1sub <- pm1sub$Sample.Value

You are quite good my friend!

|====================================================== | 61%

We’d like to make our comparison visually so we’ll have to create a time series of these pm25 measurements. First, create a dates0 variable by assigning to it the output of a call to as.Date. This will take 2 arguments. The first is a call to as.character with pm0sub$Date as the argument. The second is the format string “%Y%m%d”.

dates0 <- as.Date(as.character(pm0sub$Date),“Y%m%d”)

You almost had it, but not quite. Try again. Or, type info() for more options.
Type dates0 <- as.Date(as.character(pm0sub$Date),“%Y%m%d”) at the command prompt.

dates0 <- as.Date(as.character(pm0sub$Date),“%Y%m%d”)

Excellent job!

|======================================================= | 62%

Do the same for the 2012 data. Specifically, create dates1 using pm1sub$Date as your input.

dates1 <- as.Date(as.character(pm1sub$Date),“%Y%m%d”)

You are amazing!

|======================================================== | 63%

Now we’ll plot these 2 time series in the same panel using the base plotting system. Call par with 2 arguments. The first is mfrow set equal to c(1,2). This will tell the system we’re plotting 2 graphs in 1 row and 2 columns. The second argument will adjust the panel’s margins. It is mar set to c(4,4,2,1).

par(c(1,2),c(4,4,2,1)) [[1]] NULL

[[2]] NULL

You’re close…I can feel it! Try it again. Or, type info() for more options.
Type par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) at the command prompt.

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

That’s a job well done!

|========================================================= | 64%

Call plot with the 3 arguments dates0, x0sub, and pch set to 20. The first two arguments are the x and y coordinates. This will show the pm25 values as functions of time.

plot(dates0,x0sub,pch=20)

qownnotes-media-I12236

qownnotes-media-I12236

Now we’ll mark the median.

|=========================================================== | 66%

Use abline to add a horizontal line at the median of the pm25 values. Make the line width 2 (lwd is the argument), and when you call median with x0sub, specify the argument na.rm to be TRUE.

abline(lwd=2,h = median(x0sub))

Not quite, but you’re learning! Try again. Or, type info() for more options.
Type abline(h = median(x0sub, na.rm = TRUE),lwd=2) at the command prompt.

abline(h = median(x0sub, na.rm = TRUE),lwd=2)

You are quite good my friend!

|============================================================ | 67%

Now we’ll do the same for the 2012 data. Call plot with the 3 arguments dates1, x1sub, and pch set to 20.

plot(dates1,x1sub,pch=20)

You got it right!

|============================================================= | 68%

As before, we’ll mark the median of this 2012 data.

|============================================================= | 69%

Use abline to add a horizontal line at the median of the pm25 values. Make the line width 2 (lwd is the argument). Remember to specify the argument na.rm to be TRUE when you call median on x1sub.

abline(h = median(x1sub, na.rm = TRUE),lwd=2)

qownnotes-media-e12236

qownnotes-media-e12236

Nice work!

|============================================================== | 70%

Which median is larger - the one for 1999 or the one for 2012?

1: 2012 2: 1999

Selection: 2

Excellent work!

|=============================================================== | 71%

The picture makes it look like the median is higher for 2012 than 1999. Closer inspection shows that this isn’t true. The median for 1999 is a little over 10 micrograms per cubic meter and for 2012 its a little over 8. The plots appear this way because the 1999 plot ….

1: shows a bigger range of y values than the 2012 plot 2: shows different months than those in the 2012 plot 3: displays more points than the 2012 plot

Selection: 1

You got it right!

|================================================================ | 72%

The 1999 plot shows a much bigger range of pm25 values on the y axis, from below 10 to 40, while the 2012 pm25 values are much more restricted, from around 1 to 14. We should really plot the points of both datasets on the same range of values on the y axis. Create the variable rng by assigning to it the output of a call to the R command range with 3 arguments, x0sub, x1sub, and the boolean na.rm set to TRUE.

rng <- range(x0sub,x1sub,na.rm = TRUE)

You nailed it! Good job!

|================================================================= | 73%

Look at rng to see the values it spans.

rng [1] 3.0 40.1

Excellent work!

|================================================================== | 74%

Here a new figure we’ve created showing the two plots side by side with the same range of values on the y axis. We used the argument ylim set equal to rng in our 2 calls to plot. The improvement in the medians between 1999 and 2012 is now clear. Also notice that in 2012 there are no big values (above 15). This shows that not only is there a chronic improvement in air quality, but also there are fewer days with severe pollution.

|=================================================================== | 75%

The last avenue of this data we’ll explore (and we’ll do it quickly) concerns a comparison of all the states’ mean pollution levels. This is important because the states are responsible for implementing the regulations set at the federal level by the EPA.

|==================================================================== | 76%

Let’s first gather the mean (average measurement) for each state in 1999. Recall that the original data for this year was stored in pm0.

|===================================================================== | 77%

Create the vector mn0 with a call to the R command with using 2 arguments. The first is pm0. This is the data in which the second argument, an expression, will be evaluated. The second argument is a call to the function tapply. This call requires 4 arguments. Sample.Value and State.Code are the first two. We want to apply the function mean to Sample.Value, so mean is the third argument. The fourth is simply the boolean na.rm set to TRUE.

mn0 <- with(pm0,tapply(Sample.Value,State.Code,mean,na.rm = TRUE))

Nice work!

|====================================================================== | 78%

Call the function str with mn0 as its argument to see what it looks like.

str(mn0) num [1:53(1d)] 19.96 6.67 10.8 15.68 17.66 … - attr(*, “dimnames”)=List of 1 ..$ : chr [1:53] “1” “2” “4” “5” …

All that hard work is paying off!

|======================================================================= | 79%

We see mn0 is a 53 long numerical vector. Why 53 if there are only 50 states? As it happens, pm25 measurements for the District of Columbia (Washington D.C), the Virgin Islands, and Puerto Rico are included in this data. They are coded as 11, 72, and 78 respectively.

|======================================================================== | 80%

Recall your command creating mn0 and change it to create mn1 using pm1 as the first input to the call to with.

mn1 <- with(pm1,tapply(Sample.Value,State.Code,mean,na.rm = TRUE))

All that hard work is paying off!

|======================================================================== | 81%

For fun, call the function str with mn1 as its argument.

str(mn1) num [1:52(1d)] 10.13 4.75 8.61 10.56 9.28 … - attr(*, “dimnames”)=List of 1 ..$ : chr [1:52] “1” “2” “4” “5” …

You got it right!

|========================================================================= | 82%

So mn1 has only 52 entries, rather than 53. We checked. There are no entries for the Virgin Islands in 2012. Call summary now with mn0 as its input.

summary(mn0) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.862 9.519 12.310 12.410 15.640 19.960

That’s correct!

|========================================================================== | 84%

Now call summary with mn1 as its input so we can compare the two years.

summary(mn1) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.006 7.355 8.729 8.759 10.610 11.990

You got it right!

|=========================================================================== | 85%

We see that in all 6 entries, the 2012 numbers are less than those in 1999. Now we’ll create 2 new dataframes containing just the state names and their mean measurements for each year. First, we’ll do this for 1999. Create the data frame d0 by calling the function data.frame with 2 arguments. The first is state set equal to names(mn0), and the second is mean set equal to mn0.

d0 <- data.frame(state=names(mn0),mean=mn0)

Keep up the great work!

|============================================================================ | 86%

Recall the last command and create d1 instead of d0 using the 2012 data. (There’ll be 3 changes of 0 to 1.)

d1 <- data.frame(state=names(mn1),mean=mn1)

You are quite good my friend!

|============================================================================= | 87%

Create the array mrg by calling the R command merge with 3 arguments, d0, d1, and the argument by set equal to the string “state”.

mrg <- merge(d0,d1,by=“state”)

You got it!

|============================================================================== | 88%

Run dim with mrg as its argument to see how big it is.

dim(mrg) [1] 52 3

You are quite good my friend!

|=============================================================================== | 89%

We see merge has 52 rows and 3 columns. Since the Virgin Island data was missing from d1, it is excluded from mrg. Look at the first few entries of mrg using the head command.

head(mrg) state mean.x mean.y 1 1 19.956391 10.126190 2 10 14.492895 11.236059 3 11 15.786507 11.991697 4 12 11.137139 8.239690 5 13 19.943240 11.321364 6 15 4.861821 8.749336

You nailed it! Good job!

|================================================================================ | 90%

Each row of mrg has 3 entries - a state identified by number, a state mean for 1999 (mean.x), and a state mean for 2012 (mean.y).

|================================================================================= | 91%

Now we’ll plot the data to see how the state means changed between the 2 years. First we’ll plot the 1999 data in a single column at x=1. The y values for the points will be the state means. Again, we’ll use the R command with so we don’t have to keep typing mrg as the data environment in which to evaluate the second argument, the call to plot. We’ve already reset the graphical parameters for you.

|================================================================================== | 92%

For the first column of points, call with with 2 arguments. The first is mrg, and the second is the call to plot with 3 arguments. The first of these is rep(1,52). This tells the plot routine that the x coordinates for all 52 points are 1. The second argument is the second column of mrg or mrg[,2] which holds the 1999 data. The third argument is the range of x values we want, namely xlim set to c(.5,2.5). This works since we’ll be plotting 2 columns of points, one at x=1 and the other at x=2.

with(mrg,plot(rep(1,52),mrg[,2],xlim = c(.5,2.5)))

Perseverance, that’s the answer.

|=================================================================================== | 93%

We see a column of points at x=1 which represent the 1999 state means. For the second column of points, again call with with 2 arguments. As before, the first is mrg. The second, however, is a call to the function points with 2 arguments. We need to do this since we’re adding points to an already existing plot. The first argument to points is the set of x values, rep(2,52). The second argument is the set of y values, mrg[,3]. Of course, this is the third column of mrg. (We don’t need to specify the range of x values again.)

with(mrg,points(rep(2,52),mrg[,3]))

You are amazing!

|=================================================================================== | 94%

We see a shorter column of points at x=2. Now let’s connect the dots. Use the R function segments with 4 arguments. The first 2 are the x and y coordinates of the 1999 points and the last 2 are the x and y coordinates of the 2012 points. As in the previous calls specify the x coordinates with calls to rep and the y coordinates with references to the appropriate columns of mrg.

segments(rep(1,52),mrg[,2],rep(2,52),mrg[,3])

qownnotes-media-o10868

qownnotes-media-o10868

For fun, let’s see which states had higher means in 2012 than in 1999. Just use the mrg[mrg\(mean.x < mrg\)mean.y, ] notation to find the rows of mrg with this particulate property.

mrg[mrg\(mean.x < mrg\)mean.y, ] state mean.x mean.y 6 15 4.861821 8.749336 23 31 9.167770 9.207489 27 35 6.511285 8.089755 33 40 10.657617 10.849870

All that hard work is paying off!

|====================================================================================== | 97%

Only 4 states had worse pollution averages, and 2 of these had means that were very close. If you want to see which states (15, 31, 35, and 40) these are, you can check out this website https://www.epa.gov/enviro/state-fips-code-listing to decode the state codes.

|======================================================================================= | 98%

This concludes the lesson, comparing air pollution data from two years in different ways. First, we looked at measures of the entire set of monitors, then we compared the two measures from a particular monitor, and finally, we looked at the mean measures of the individual states.

Miguel Angel Huerta

16 de octubre de 2018